arxiv: 2604.19750 · v1 · submitted 2026-03-14 · 💻 cs.SE · cs.AI· cs.HC

Recognition: no theorem link

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

Zhilin Liu , Ye Huang , Ting Xie , Ruizhi Zhang , Wen Li , Lixin Duan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:02 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.HC

keywords GUI code generationvisual feedbackLLM agentscode debuggingmulti-agent systemInteractGUI Benchdesktop applicationsvision-based debugging

0 comments

The pith

Visual feedback and simulated interactions enable LLM agents to debug GUI code more reliably than text alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM agents for code generation often fail on graphical user interfaces because they rely on text outputs that cannot assess visual layout or trigger event-driven behaviors through simulated clicks and inputs. The paper introduces InteractGUI Bench, a collection of 984 real-world desktop GUI tasks for testing both logic and appearance, along with VF-Coder, a multi-agent system that supplies visual screen perception and direct interaction capabilities. By operating in a human-like loop of seeing the rendered interface and acting on it, the system identifies and corrects issues that text feedback misses. On the benchmark this raises success rates and visual quality scores for models like Gemini-3-Flash. A reader would care because most everyday software involves GUIs, so closing the visual gap could make automated code generation practical for a large class of programs.

Core claim

VF-Coder is a vision-feedback-based multi-agent system that perceives visual information from the rendered GUI interface and directly interacts with program elements to identify potential logic and layout issues, thereby increasing the success rate of Gemini-3-Flash from 21.68% to 28.29% and the visual score from 0.4284 to 0.5584 on InteractGUI Bench.

What carries the argument

VF-Coder, a vision-feedback-based multi-agent system that supplies visual perception of the rendered interface and simulated user interactions to generate corrective feedback for GUI code.

If this is right

GUI code debugging improves when agents can see the rendered screen and simulate interactions rather than relying on text outputs alone.
Text-based approaches remain limited for event-driven programs and visual attributes that require direct inspection of the interface.
InteractGUI Bench enables fine-grained measurement of both interaction logic and visual structure in GUI tasks.
Multi-agent setups can use vision to locate and fix functional errors as well as appearance mismatches in one workflow.
Base model performance on GUI generation rises when visual feedback is added, as shown by the lift from 21.68% to 28.29% success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same visual loop could be tested on web or mobile GUI frameworks to check whether desktop results generalize.
Stronger vision-language models would likely widen the observed gap between text-only and visual-feedback methods.
The benchmark could serve as a standard testbed for comparing alternative feedback mechanisms such as accessibility trees or pixel-level analysis.
Integrating visual feedback early in code generation rather than only at debug time might further reduce the number of iterations required.

Load-bearing premise

Visual perception of the rendered interface combined with simulated interactions is sufficient to detect and correct both logic errors and layout problems that text-based feedback cannot resolve.

What would settle it

A GUI task whose errors depend on non-visible internal state such as network responses or hidden variables, where VF-Coder produces no gain over text-only debugging.

Figures

Figures reproduced from arXiv: 2604.19750 by Lixin Duan, Ruizhi Zhang, Ting Xie, Wen Li, Ye Huang, Zhilin Liu.

**Figure 1.** Figure 1: Comparison between (a, b) traditional text-output-based debugging procedure and (c) our vision-perception-based debugging procedure (VF-Coder), which enables direct perception of rendered GUIs for interactive correction. Although the advancement of Multimodal Large Language Models (MLLMs) offers visual perception for code agents, making them now able to generate GUI code directly from real UI screenshots o… view at source ↗

**Figure 2.** Figure 2: The data collection and automatic-testing pipeline. (a) shows the process of collecting real-world desktop UI screenshots from Flathub, constructing task instructions, and building the IES. (b) illustrates the workflow for testing generated GUI apps using a series of steps defined by the IES. Zoom in to see better. – Resource Sensitivity Handling: MLLMs cannot autonomously generate logos or images. To pre… view at source ↗

**Figure 3.** Figure 3: Overall architecture (left) and interaction details(right) of VF-Coder. Zoom in to see better. The right presents a real case of VF-Coder: The GUI Operator detects a rendering error on the download page through visual interaction. Under the coordination of the Task Planner, the Code Fixer successfully identifies and repairs the issue. consists of a MLLM equipped with a session memory module. Upon receivin… view at source ↗

**Figure 4.** Figure 4: Visual comparison of Cursor CLI, Gemini CLI and VF-Coder in a real case. still exhibit visible gaps compared to the Real UI Screenshots. This limitation likely stems from the inherent capability boundaries of base models, reflecting the common challenges faced by current approaches in generating real-world desktop GUI applications. 6 Conclusion Existing code agents primarily rely on textual feedback to ens… view at source ↗

**Figure 5.** Figure 5: Task Category Composition of InteractGUI Bench [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Visual comparison of rendered interfaces by Cursor CLI, Gemini CLI, and VFCoder on representative InteractGUI Bench tasks. The Visual Score is computed by our Visual Evaluation Model. In these cases, our VF-Coder achieves the highest visual fidelity to real GUI screenshots and the highest visual score [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: illustrates the MAE curve on the training set and validation set during the training process. Since the 18th epoch achieved the best MAE performance (0.0964) on the validation set, we ultimately select the model from the 18th epoch as our final visual evaluation model [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative visualization of Cases a–d ranked by Visual Evaluation Model scores (high to moderate). Scores correlate well with perceived visual similarity to real screenshots [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative visualization of Cases e–g ranked by Visual Evaluation Model scores (moderate to near-zero). Near-zero scores correctly indicate either significant rendering errors or fundamental differences in application layout styles [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: InteractGUI Bench Task Showcase C:\Users\刘知林 \Desktop\GUI_Benchmark\desktop_ui_samples\sample_00318\Page_Dashboard_Home.png Please develop a desktop application for Visual Studio Code following the provided design. The 'Page_VSCode_Main' serves as the primary entry interface, featuring a comprehensive workspace layout. At the top, implement a standard menu bar including 'File', 'Edit', 'Selection', 'View'… view at source ↗

**Figure 11.** Figure 11: InteractGUI Bench Task Showcase [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: InteractGUI Bench Task Showcase them to generate executable GUI Apps. For the proposed VF-Coder, we implement three agents, Task Planner, GUI Operator, and Code Fixer, whose prompts are detailed in Figures 14, 15, and 16, respectively [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: The prompt used for base models (e.g., GPT-5.2, Claude-4.5-Sonnet) to generate GUI apps on InteractGUI Bench Prompt: Your Task You are a GUI programming assistant responsible for executing complex GUI program debugging tasks. You need to combine the provided tools, user instructions and reference images to gradually improve and optimize the GUI program, ensuring that the visual presentation effects are a… view at source ↗

**Figure 14.** Figure 14: The prompt used for VF-Coder’s Task Planner [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: The prompt used for VF-Coder’s GUI Operator [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: The prompt used for VF-Coder’s Code Fixer [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

read the original abstract

Recent advances in Large Language Model (LLM)-based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text-output-based feedback (e.g. command-line outputs) for multi-round debugging and struggle in graphical user interface (GUI) that involve visual information. This is mainly due to two limitations: 1) GUI programs are event-driven, yet existing methods cannot simulate user interactions to trigger GUI element logic 2) GUI programs possess visual attributes, making it difficult for text-based approaches to assess whether the rendered interface meets user needs. To systematically address these challenges, we first introduce InteractGUI Bench, a novel benchmark comprising 984 commonly used real-world desktop GUI application tasks designed for fine-grained evaluation of both interaction logic and visual structure. Furthermore, we propose VF-Coder, a vision-feedback-based multi-agent system for debugging GUI code. By perceiving visual information and directly interacting with program interfaces, VF-Coder can identify potential logic and layout issues in a human-like manner. On InteractGUI Bench, our VF-Coder approach increases the success rate of Gemini-3-Flash from 21.68% to 28.29% and raises the visual score from 0.4284 to 0.5584, indicating the effectiveness of visual feedback in GUI debugging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VF-Coder adds a new GUI benchmark and visual feedback loop that lifts a base model a bit, but the gains are small and not cleanly isolated from the multi-agent setup.

read the letter

The main things to know are that the paper introduces InteractGUI Bench, a new set of 984 desktop GUI tasks focused on both interaction logic and visual layout, and VF-Coder, a multi-agent system that feeds screenshots back to the model for debugging. On Gemini-3-Flash it moves success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584. That is the concrete new artifact and the reported result. The approach makes sense on paper because text-only feedback cannot judge rendered screens or easily trigger event-driven GUI behavior, so closing the loop with actual visual perception and simulated clicks is a reasonable direction for real developer tools. The benchmark itself looks like a useful step toward more grounded evaluation in this space. The soft spots sit in the attribution of the gains. The headline numbers are modest, and without an ablation that keeps the same agent loop and interaction simulation but swaps screenshots for textual UI descriptions, it is hard to tell how much comes from vision versus the overall scaffolding. The abstract also gives no protocol details, baseline breakdowns, or significance tests, so the numbers cannot be stress-tested for confounds or reproducibility from what is shown. The benchmark is new, which limits external checks. This work is for people building LLM agents for software engineering who care about GUI domains. A reader who wants concrete artifacts and empirical numbers on visual feedback will get something out of it, even if the current evidence is preliminary. It deserves peer review because the problem is real and the artifacts are new, but any referee will need to press on the missing controls and fuller reporting of the experimental setup.

Referee Report

1 major / 1 minor

Summary. The paper introduces InteractGUI Bench, a benchmark of 984 real-world desktop GUI tasks for fine-grained evaluation of interaction logic and visual structure, and proposes VF-Coder, a vision-feedback-based multi-agent system that perceives screenshots and simulates interactions to debug GUI code. It claims that VF-Coder raises Gemini-3-Flash success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on the benchmark.

Significance. If the gains can be attributed specifically to visual feedback rather than the multi-agent scaffolding, the work would offer a practical method for overcoming text-only limitations in GUI code generation and debugging, particularly for layout and event-driven issues.

major comments (1)

[Experiments] Experiments section: the headline improvements are attributed to visual perception plus interaction, yet no ablation is reported that holds the multi-agent loop, planning, and interaction simulation fixed while replacing screenshots with textual UI descriptions. This control is required to isolate the contribution of vision from the overall architecture.

minor comments (1)

[Abstract] Abstract: concrete numerical results are stated without any description of the experimental protocol, baseline implementations, number of runs, statistical tests, or controls for confounds such as prompt variations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We agree that isolating the contribution of visual feedback is important for strengthening the claims and will revise the manuscript to address this.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline improvements are attributed to visual perception plus interaction, yet no ablation is reported that holds the multi-agent loop, planning, and interaction simulation fixed while replacing screenshots with textual UI descriptions. This control is required to isolate the contribution of vision from the overall architecture.

Authors: We agree that this ablation is necessary to isolate the specific contribution of direct visual perception. In the revised manuscript, we will add a new control experiment that keeps the multi-agent loop, planning, and interaction simulation fixed while replacing screenshot inputs with textual UI descriptions (e.g., generated via accessibility trees or LLM-based captioning of the same interfaces). We will report success rates and visual scores for this text-only variant on InteractGUI Bench and compare them directly to the full VF-Coder results. This will clarify the incremental benefit of vision over the scaffolding alone. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark results contain no definitional or self-referential circularity

full rationale

The paper introduces InteractGUI Bench as a new collection of 984 tasks and VF-Coder as a multi-agent architecture that uses screenshots plus simulated interactions. Reported gains (Gemini-3-Flash success 21.68 % → 28.29 %, visual score 0.4284 → 0.5584) are direct head-to-head measurements on the same fixed task set. No equations, fitted parameters, or self-citations are invoked to derive these numbers; the quantities are externally observable success and visual-match metrics. The derivation chain is therefore self-contained and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5546 in / 966 out tokens · 49923 ms · 2026-05-15T12:02:45.033197+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 5 internal anchors

[1]

CoRR abs/2501.16692(2025).https : / / doi

Acharya, M., Zhang, Y., Leach, K., Huang, Y.: Optimizing code run- time performance through context-aware retrieval-augmented generation. CoRR abs/2501.16692(2025).https : / / doi . org / 10 . 48550 / ARXIV . 2501 . 16692, https://doi.org/10.48550/arXiv.2501.16692

work page doi:10.48550/arxiv.2501.16692 2025
[2]

anthropic

Anthropic: Claude sonnet 4.5.https : / / www . anthropic . com / claude / sonnet (2025)

work page 2025
[3]

In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=G7sIFXugTX

Antoniades, A., Örwall, A., Zhang, K., Xie, Y., Goyal, A., Wang, W.Y.: SWE- search: Enhancing software agents with monte carlo tree search and iterative refine- ment. In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=G7sIFXugTX

work page 2025
[4]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al.: Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Google, D.: Gemini 3 flash: Best for frontier intelligence at speed.https:// deepmind.google/models/gemini/flash/(2025)

work page 2025
[7]

Google, D.: Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life.https://deepmind.google/models/gemini/pro/(2026)

work page 2026
[8]

arXiv preprint arXiv:2406.12276 (2024)

Gupta, T., Weihs, L., Kembhavi, A.: Codenav: Beyond tool-use to using real-world codebases with llm agents. arXiv preprint arXiv:2406.12276 (2024)

work page arXiv 2024
[9]

In: European conference on computer vision

He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European conference on computer vision. pp. 630–645. Springer (2016)

work page 2016
[10]

Measuring Coding Challenge Competence With APPS

Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al.: Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

arXiv preprint arXiv:2501.17167 (2025)

Hu, Y., Zhou, Q., Chen, Q., Li, X., Liu, L., Zhang, D., Kachroo, A., Oz, T., Tripp, O.: Qualityflow: An agentic workflow for program synthesis controlled by llm quality checks. arXiv preprint arXiv:2501.17167 (2025)

work page arXiv 2025
[12]

In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= chfJJYC3iL

Jain, N., Han, K., Gu, A., Li, W.D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., Stoica, I.: Livecodebench: Holistic and contamination free evalua- tion of large language models for code. In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= chfJJYC3iL

work page 2025
[13]

In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

Jiang, X., Dong, Y., Tao, Y., Liu, H., Jin, Z., Li, G.: Rocode: Integrating backtrack- ing mechanism and program analysis in large language models for code generation. In: ICSE. pp. 334–346 (2025),https://doi.org/10.1109/ICSE55347.2025.00133

work page doi:10.1109/icse55347.2025.00133 2025
[14]

Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K.R.: SWE-bench: Can language models resolve real-world github issues? In: The Twelfth International Conference on Learning Representations (2024),https: //openreview.net/forum?id=VTF8yNQM66

work page 2024
[15]

In: ICML deep learning workshop

Koch, G., Zemel, R., Salakhutdinov, R., et al.: Siamese neural networks for one- shot image recognition. In: ICML deep learning workshop. vol. 2, pp. 1–30. Lille (2015)

work page 2015
[16]

arXiv preprint arXiv:2403.09029 (2024) 16 Zhilin Liu et al

Laurençon,H.,Tronchon,L.,Sanh,V.:Unlockingtheconversionofwebscreenshots into html code with the websight dataset. arXiv preprint arXiv:2403.09029 (2024) 16 Zhilin Liu et al

work page arXiv 2024
[17]

In: NeurIPS (2024),http://papers.nips.cc/paper_files/ paper / 2024 / hash / 6a059625a6027aca18302803743abaa2 - Abstract - Datasets _ and_Benchmarks_Track.html

Li, J., Li, G., Zhang, X., Zhao, Y., Dong, Y., Jin, Z., Li, B., Huang, F., Li, Y.: Evocodebench: An evolving code generation benchmark with domain- specific evaluations. In: NeurIPS (2024),http://papers.nips.cc/paper_files/ paper / 2024 / hash / 6a059625a6027aca18302803743abaa2 - Abstract - Datasets _ and_Benchmarks_Track.html

work page 2024
[18]

In: ACL (Findings)

Li, J., Li, G., Zhao, Y., Li, Y., Liu, H., Zhu, H., Wang, L., Liu, K., Fang, Z., Wang, L., Ding, J., Zhang, X., Zhu, Y., Dong, Y., Jin, Z., Li, B., Huang, F., Li, Y., Gu, B., Yang, M.: Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories. In: ACL (Findings). pp. 3603–3614 (2024),https: //doi.org/10.18653/v1/202...

work page doi:10.18653/v1/2024.findings-acl.214 2024
[19]

Science378(6624), 1092–1097 (2022)

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al.: Competition-level code generation with alphacode. Science378(6624), 1092–1097 (2022)

work page 2022
[20]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

Lu, Z., Yang, Y., Ren, H., Hou, H., Xiao, H., Wang, K., Shi, W., Zhou, A., Zhan, M., Li, H.: Webgen-bench: Evaluating llms on generating interactive and func- tional websites from scratch. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page
[21]

arXiv preprint arXiv:2412.13501 (2024)

Nguyen, D., Chen, J., Wang, Y., Wu, G., Park, N., Hu, Z., Lyu, H., Wu, J., Aponte, R., Xia, Y., et al.: Gui agents: A survey. arXiv preprint arXiv:2412.13501 (2024)

work page arXiv 2024
[22]

com / zh - Hans - CN / index / introducing-gpt-5-2/(2025)

OpenAI: Introducing gpt-5.2.https : / / openai . com / zh - Hans - CN / index / introducing-gpt-5-2/(2025)

work page 2025
[23]

CausalGraph2LLM: Evaluating LLMs for causal queries

Si, C., Zhang, Y., Li, R., Yang, Z., Liu, R., Yang, D.: Design2code: Benchmark- ing multimodal code generation for automated front-end engineering. In: NAACL (Long Papers). pp. 3956–3974 (2025),https://doi.org/10.18653/v1/2025. naacl-long.199

work page doi:10.18653/v1/2025 2025
[24]

In: Forty-second International Conference on Machine Learning (2025)

Sohrabizadeh, A., Song, J., Liu, M., Roy, R., Lee, C., Raiman, J., Catanzaro, B.: Nemotron-CORTEXA: Enhancing LLM agents for software engineering tasks via improved localization and solution diversity. In: Forty-second International Conference on Machine Learning (2025)

work page 2025
[25]

arXiv preprint arXiv:2508.03923 (2025)

Song, L., Dai, Y., Prabhu, V., Zhang, J., Shi, T., Li, L., Li, J., Savarese, S., Chen, Z., Zhao, J., et al.: Coact-1: Computer-using agents with coding as actions. arXiv preprint arXiv:2508.03923 (2025)

work page arXiv 2025
[26]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Sun, Y., Zhao, S., Yu, T., Wen, H., Va, S., Xu, M., Li, Y., Zhang, C.: Gui-xplore: Empowering generalizable gui agents with one exploration. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19477–19486 (2025)

work page 2025
[27]

Team, C.: Cursor document.https://cursor.com/cn/docs(2025)

work page 2025
[28]

Team, G.C.: Build, debug & deploy with ai.https://geminicli.com/(2025)

work page 2025
[29]

Kimi K2.5: Visual Agentic Intelligence

Team, K., Bai, T., Bai, Y., Bao, Y., Cai, S., Cao, Y., Charles, Y., Che, H., Chen, C., Chen, G., et al.: Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Wang, H., Zou, H., Song, H., Feng, J., Fang, J., Lu, J., Liu, L., Luo, Q., Liang, S., Huang, S., et al.: Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= XLMAMmowdY

Wang, R., Han, X., Ji, L., Wang, S., Baldwin, T., Li, H.: Toolgen: Unified tool retrieval and calling via generation. In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= XLMAMmowdY

work page 2025
[32]

Proceedings of the ACM on Software Engineering 2(ISSTA), 1123–1145 (2025) Visual Feedback Unlocks Reliable GUI Code Generating and Debugging 17

Wu, F., Gao, C., Li, S., Wen, X.C., Liao, Q.: Mllm-based ui2code automation guided by ui layout information. Proceedings of the ACM on Software Engineering 2(ISSTA), 1123–1145 (2025) Visual Feedback Unlocks Reliable GUI Code Generating and Debugging 17

work page 2025
[33]

arXiv preprint arXiv:2411.03292 (2024)

Xiao, J., Wan, Y., Huo, Y., Wang, Z., Xu, X., Wang, W., Xu, Z., Wang, Y., Lyu, M.R.: Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. arXiv preprint arXiv:2411.03292 (2024)

work page arXiv 2024
[34]

arXiv preprint arXiv:2505.07473 (2025)

Xu, K., Mao, Y., Guan, X., Feng, Z.: Web-bench: A llm code benchmark based on web standards and frameworks. arXiv preprint arXiv:2505.07473 (2025)

work page arXiv 2025
[35]

arXiv preprint arXiv:2507.05791 (2025)

Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., et al.: Gta1: Gui test-time scaling agent. arXiv preprint arXiv:2507.05791 (2025)

work page arXiv 2025
[36]

Ye, H., Yang, A.Z., Hu, C., Wang, Y., Zhang, T., Le Goues, C.: Adverintent- agent: Adversarial reasoning for repair based on inferred program intent. Proc. ACM Softw. Eng.2(ISSTA) (Jun 2025).https://doi.org/10.1145/3728939, https://doi.org/10.1145/3728939

work page doi:10.1145/3728939 2025
[37]

In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

Yuan, M., Chen, J., Xing, Z., Quigley, A., Luo, Y., Luo, T., Mohammadi, G., Lu, Q., Zhu, L.: Designrepair: Dual-stream design guideline-aware frontend repair with large language models. In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). p. 2483–2494. IEEE (Apr 2025).https://doi. org/10.1109/icse55347.2025.00109,http://dx.doi.o...

work page doi:10.1109/icse55347.2025.00109 2025
[38]

In: ACL (1)

Zhang, K., Li, J., Li, G., Shi, X., Jin, Z.: Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. In: ACL (1). pp. 13643–13658 (2024),https://doi.org/10.18653/v1/2024.acl- long.737

work page doi:10.18653/v1/2024.acl- 2024
[39]

arXiv preprint arXiv:2305.04087 (2023)

Zhang, K., Li, Z., Li, J., Li, G., Jin, Z.: Self-edit: Fault-aware code editor for code generation. arXiv preprint arXiv:2305.04087 (2023)

work page arXiv 2023
[40]

arXiv preprint arXiv:2506.15655 (2025)

Zhang, Y., Zhao, X., Wang, Z.Z., Yang, C., Wei, J., Wu, T.: cast: Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree. arXiv preprint arXiv:2506.15655 (2025)

work page arXiv 2025
[41]

arXiv preprint arXiv:2512.22047 (2025)

Zhou, H., Zhang, X., Tong, P., Zhang, J., Chen, L., Kong, Q., Cai, C., Liu, C., Wang, Y., Zhou, J., et al.: Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047 (2025)

work page arXiv 2025
[42]

Direct”, “Text-Augmented

Zhu, H., Zhang, Y., Zhao, B., Ding, J., Liu, S., Liu, T., Wang, D., Liu, Y., Li, Z.: Frontendbench: A benchmark for evaluating llms on front-end development via automatic evaluation. arXiv preprint arXiv:2506.13832 (2025) Visual Feedback Unlocks Reliable GUI Code Generating and Debugging 1 A More discussions of Proposed InteractGUI Bench In this section, ...

work page arXiv 2025
[43]

For icons or images that cannot be implemented, you may use rectangular boxes as placeholders

1:1 reproduces all the visual contents from the screenshots (including layout, colors, fonts, spacing, text content and all UI elements). For icons or images that cannot be implemented, you may use rectangular boxes as placeholders

work page
[44]

Implements all the interaction logic mentioned in the instructions

work page
[45]

Generates complete, directly runnable code without any syntax errors {instruction} Requirements:

work page
[46]

• If you generate multiple files, all files must be placed in the same directory • You are NOT allowed to create subdirectories

File Structure • You are allowed to implement all functionality in a single code file, or generate multiple code files, no matter what, you must use main.py as the entry point (ensure that running python main.py can start the GUI program). • If you generate multiple files, all files must be placed in the same directory • You are NOT allowed to create subd...

work page
[47]

You are recommended to use setAccessibleName() to set the accessible name

Component Naming • Please strictly use the element names marked with single quotes '' in the instructions to name components such as icon buttons, text buttons, input boxes, etc., ensuring these elements can be accessed through accessibility tools. You are recommended to use setAccessibleName() to set the accessible name

work page
[48]

Output Format • Each file must use independent code blocks, formatted as: ```python filename="main.py" # code here ``` Fig. 13:The prompt used for base models (e.g., GPT-5.2, Claude-4.5-Sonnet) to gen- erate GUI apps on InteractGUI Bench Prompt: Your Task You are a GUI programming assistant responsible for executing complex GUI program debugging tasks. Yo...

work page
[49]

Verify that the interaction logic mentioned in the instructions can be correctly executed

Test interaction logic according to instruction requirements: Follow the navigation path described in the instructions (e.g., starting from a specific page, clicking certain buttons/elements, navigating to target pages). Verify that the interaction logic mentioned in the instructions can be correctly executed. If any step in the navigation path fails (e.g...

work page
[50]

Check visual consistency with reference images: If a reference page screenshot is provided, compare the current page with the reference image to check if the visual layout, element positions, and styles are consistent. **Focus only on structural visual information (layout, element positions, styles), and do NOT check content- related elements such as imag...

work page
[51]

If scrolling operation (scroll) is ineffective → Still cannot find the required content, indicating the interface may not support scrolling or is out of screen range, should immediately report through action=terminate

work page
[52]

If clicking operation (click) is ineffective → Indicates button failure or event not bound, should immediately report the specific problem through action=terminate

work page
[53]

name": <function-name>,

Do not try the same operation a third time, do not look for other reasons to continue, must immediately terminate and report Output format: Thought: You should think step by step and provide a detailed thought process before generating the next action. Action: - You need to use functions within <tools></tools> XML tags to perform actions based on observat...

work page
[54]

Carefully analyze bug descriptions, reference page screenshot and the rendering interface screenshot when error occurs to understand the root cause of problems

work page
[55]

Locate the parts that need modification in the source code

work page
[56]

Use SEARCH/REPLACE blocks for precise incremental modifications

work page
[57]

Ensure fixes do not introduce new problems and the visual rendering effect is as consistent as possible with the reference page screenshot Output Format Requirements - SEARCH/REPLACE Blocks Use the following format for code modifications (this avoids regenerating the entire file, saves tokens, and reduces error risk): filename.ext ```language <<<<<<< SEAR...

work page
[58]

Exact Match: The SEARCH part must exactly match the file content character by character (including spaces, indentation, comments)

work page
[59]

Replace Once: Each SEARCH/REPLACE block will only replace the first occurrence

work page
[60]

Multiple Modifications: If multiple modifications are needed, use multiple independent blocks

work page
[61]

Keep It Concise: Only include the lines that need modification and a small amount of context (2-5 lines), do not include large unchanged code sections

work page
[62]

16:The prompt used for VF-Coder’s Code Fixer

Create New Files: Leave the SEARCH part empty, put the new file content in the REPLACE part Fig. 16:The prompt used for VF-Coder’s Code Fixer

work page