pith. machine review for the scientific record. sign in

arxiv: 2604.19750 · v1 · submitted 2026-03-14 · 💻 cs.SE · cs.AI· cs.HC

Recognition: no theorem link

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:02 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.HC
keywords GUI code generationvisual feedbackLLM agentscode debuggingmulti-agent systemInteractGUI Benchdesktop applicationsvision-based debugging
0
0 comments X

The pith

Visual feedback and simulated interactions enable LLM agents to debug GUI code more reliably than text alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM agents for code generation often fail on graphical user interfaces because they rely on text outputs that cannot assess visual layout or trigger event-driven behaviors through simulated clicks and inputs. The paper introduces InteractGUI Bench, a collection of 984 real-world desktop GUI tasks for testing both logic and appearance, along with VF-Coder, a multi-agent system that supplies visual screen perception and direct interaction capabilities. By operating in a human-like loop of seeing the rendered interface and acting on it, the system identifies and corrects issues that text feedback misses. On the benchmark this raises success rates and visual quality scores for models like Gemini-3-Flash. A reader would care because most everyday software involves GUIs, so closing the visual gap could make automated code generation practical for a large class of programs.

Core claim

VF-Coder is a vision-feedback-based multi-agent system that perceives visual information from the rendered GUI interface and directly interacts with program elements to identify potential logic and layout issues, thereby increasing the success rate of Gemini-3-Flash from 21.68% to 28.29% and the visual score from 0.4284 to 0.5584 on InteractGUI Bench.

What carries the argument

VF-Coder, a vision-feedback-based multi-agent system that supplies visual perception of the rendered interface and simulated user interactions to generate corrective feedback for GUI code.

If this is right

  • GUI code debugging improves when agents can see the rendered screen and simulate interactions rather than relying on text outputs alone.
  • Text-based approaches remain limited for event-driven programs and visual attributes that require direct inspection of the interface.
  • InteractGUI Bench enables fine-grained measurement of both interaction logic and visual structure in GUI tasks.
  • Multi-agent setups can use vision to locate and fix functional errors as well as appearance mismatches in one workflow.
  • Base model performance on GUI generation rises when visual feedback is added, as shown by the lift from 21.68% to 28.29% success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same visual loop could be tested on web or mobile GUI frameworks to check whether desktop results generalize.
  • Stronger vision-language models would likely widen the observed gap between text-only and visual-feedback methods.
  • The benchmark could serve as a standard testbed for comparing alternative feedback mechanisms such as accessibility trees or pixel-level analysis.
  • Integrating visual feedback early in code generation rather than only at debug time might further reduce the number of iterations required.

Load-bearing premise

Visual perception of the rendered interface combined with simulated interactions is sufficient to detect and correct both logic errors and layout problems that text-based feedback cannot resolve.

What would settle it

A GUI task whose errors depend on non-visible internal state such as network responses or hidden variables, where VF-Coder produces no gain over text-only debugging.

Figures

Figures reproduced from arXiv: 2604.19750 by Lixin Duan, Ruizhi Zhang, Ting Xie, Wen Li, Ye Huang, Zhilin Liu.

Figure 1
Figure 1. Figure 1: Comparison between (a, b) traditional text-output-based debugging procedure and (c) our vision-perception-based debugging procedure (VF-Coder), which enables direct perception of rendered GUIs for interactive correction. Although the advancement of Multimodal Large Language Models (MLLMs) offers visual perception for code agents, making them now able to generate GUI code directly from real UI screenshots o… view at source ↗
Figure 2
Figure 2. Figure 2: The data collection and automatic-testing pipeline. (a) shows the process of collecting real-world desktop UI screenshots from Flathub, constructing task instruc￾tions, and building the IES. (b) illustrates the workflow for testing generated GUI apps using a series of steps defined by the IES. Zoom in to see better. – Resource Sensitivity Handling: MLLMs cannot autonomously generate logos or images. To pre… view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture (left) and interaction details(right) of VF-Coder. Zoom in to see better. The right presents a real case of VF-Coder: The GUI Operator detects a rendering error on the download page through visual interaction. Under the coordina￾tion of the Task Planner, the Code Fixer successfully identifies and repairs the issue. consists of a MLLM equipped with a session memory module. Upon receivin… view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison of Cursor CLI, Gemini CLI and VF-Coder in a real case. still exhibit visible gaps compared to the Real UI Screenshots. This limitation likely stems from the inherent capability boundaries of base models, reflecting the common challenges faced by current approaches in generating real-world desktop GUI applications. 6 Conclusion Existing code agents primarily rely on textual feedback to ens… view at source ↗
Figure 5
Figure 5. Figure 5: Task Category Composition of InteractGUI Bench [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison of rendered interfaces by Cursor CLI, Gemini CLI, and VF￾Coder on representative InteractGUI Bench tasks. The Visual Score is computed by our Visual Evaluation Model. In these cases, our VF-Coder achieves the highest visual fidelity to real GUI screenshots and the highest visual score [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: illustrates the MAE curve on the training set and validation set during the training process. Since the 18th epoch achieved the best MAE per￾formance (0.0964) on the validation set, we ultimately select the model from the 18th epoch as our final visual evaluation model [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative visualization of Cases a–d ranked by Visual Evaluation Model scores (high to moderate). Scores correlate well with perceived visual similarity to real screen￾shots [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative visualization of Cases e–g ranked by Visual Evaluation Model scores (moderate to near-zero). Near-zero scores correctly indicate either significant rendering errors or fundamental differences in application layout styles [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: InteractGUI Bench Task Showcase C:\Users\刘知林 \Desktop\GUI_Benchmark\desktop_ui_samples\sample_00318\Page_Dashboard_Home.png Please develop a desktop application for Visual Studio Code following the provided design. The 'Page_VSCode_Main' serves as the primary entry interface, featuring a comprehensive workspace layout. At the top, implement a standard menu bar including 'File', 'Edit', 'Selection', 'View'… view at source ↗
Figure 11
Figure 11. Figure 11: InteractGUI Bench Task Showcase [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: InteractGUI Bench Task Showcase them to generate executable GUI Apps. For the proposed VF-Coder, we imple￾ment three agents, Task Planner, GUI Operator, and Code Fixer, whose prompts are detailed in Figures 14, 15, and 16, respectively [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The prompt used for base models (e.g., GPT-5.2, Claude-4.5-Sonnet) to gen￾erate GUI apps on InteractGUI Bench Prompt: Your Task You are a GUI programming assistant responsible for executing complex GUI program debugging tasks. You need to combine the provided tools, user instructions and reference images to gradually improve and optimize the GUI program, ensuring that the visual presentation effects are a… view at source ↗
Figure 14
Figure 14. Figure 14: The prompt used for VF-Coder’s Task Planner [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The prompt used for VF-Coder’s GUI Operator [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The prompt used for VF-Coder’s Code Fixer [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
read the original abstract

Recent advances in Large Language Model (LLM)-based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text-output-based feedback (e.g. command-line outputs) for multi-round debugging and struggle in graphical user interface (GUI) that involve visual information. This is mainly due to two limitations: 1) GUI programs are event-driven, yet existing methods cannot simulate user interactions to trigger GUI element logic 2) GUI programs possess visual attributes, making it difficult for text-based approaches to assess whether the rendered interface meets user needs. To systematically address these challenges, we first introduce InteractGUI Bench, a novel benchmark comprising 984 commonly used real-world desktop GUI application tasks designed for fine-grained evaluation of both interaction logic and visual structure. Furthermore, we propose VF-Coder, a vision-feedback-based multi-agent system for debugging GUI code. By perceiving visual information and directly interacting with program interfaces, VF-Coder can identify potential logic and layout issues in a human-like manner. On InteractGUI Bench, our VF-Coder approach increases the success rate of Gemini-3-Flash from 21.68% to 28.29% and raises the visual score from 0.4284 to 0.5584, indicating the effectiveness of visual feedback in GUI debugging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces InteractGUI Bench, a benchmark of 984 real-world desktop GUI tasks for fine-grained evaluation of interaction logic and visual structure, and proposes VF-Coder, a vision-feedback-based multi-agent system that perceives screenshots and simulates interactions to debug GUI code. It claims that VF-Coder raises Gemini-3-Flash success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on the benchmark.

Significance. If the gains can be attributed specifically to visual feedback rather than the multi-agent scaffolding, the work would offer a practical method for overcoming text-only limitations in GUI code generation and debugging, particularly for layout and event-driven issues.

major comments (1)
  1. [Experiments] Experiments section: the headline improvements are attributed to visual perception plus interaction, yet no ablation is reported that holds the multi-agent loop, planning, and interaction simulation fixed while replacing screenshots with textual UI descriptions. This control is required to isolate the contribution of vision from the overall architecture.
minor comments (1)
  1. [Abstract] Abstract: concrete numerical results are stated without any description of the experimental protocol, baseline implementations, number of runs, statistical tests, or controls for confounds such as prompt variations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We agree that isolating the contribution of visual feedback is important for strengthening the claims and will revise the manuscript to address this.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline improvements are attributed to visual perception plus interaction, yet no ablation is reported that holds the multi-agent loop, planning, and interaction simulation fixed while replacing screenshots with textual UI descriptions. This control is required to isolate the contribution of vision from the overall architecture.

    Authors: We agree that this ablation is necessary to isolate the specific contribution of direct visual perception. In the revised manuscript, we will add a new control experiment that keeps the multi-agent loop, planning, and interaction simulation fixed while replacing screenshot inputs with textual UI descriptions (e.g., generated via accessibility trees or LLM-based captioning of the same interfaces). We will report success rates and visual scores for this text-only variant on InteractGUI Bench and compare them directly to the full VF-Coder results. This will clarify the incremental benefit of vision over the scaffolding alone. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark results contain no definitional or self-referential circularity

full rationale

The paper introduces InteractGUI Bench as a new collection of 984 tasks and VF-Coder as a multi-agent architecture that uses screenshots plus simulated interactions. Reported gains (Gemini-3-Flash success 21.68 % → 28.29 %, visual score 0.4284 → 0.5584) are direct head-to-head measurements on the same fixed task set. No equations, fitted parameters, or self-citations are invoked to derive these numbers; the quantities are externally observable success and visual-match metrics. The derivation chain is therefore self-contained and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5546 in / 966 out tokens · 49923 ms · 2026-05-15T12:02:45.033197+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 5 internal anchors

  1. [1]

    CoRR abs/2501.16692(2025).https : / / doi

    Acharya, M., Zhang, Y., Leach, K., Huang, Y.: Optimizing code run- time performance through context-aware retrieval-augmented generation. CoRR abs/2501.16692(2025).https : / / doi . org / 10 . 48550 / ARXIV . 2501 . 16692, https://doi.org/10.48550/arXiv.2501.16692

  2. [2]

    anthropic

    Anthropic: Claude sonnet 4.5.https : / / www . anthropic . com / claude / sonnet (2025)

  3. [3]

    In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=G7sIFXugTX

    Antoniades, A., Örwall, A., Zhang, K., Xie, Y., Goyal, A., Wang, W.Y.: SWE- search: Enhancing software agents with monte carlo tree search and iterative refine- ment. In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=G7sIFXugTX

  4. [4]

    Program Synthesis with Large Language Models

    Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al.: Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

  5. [5]

    Evaluating Large Language Models Trained on Code

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

  6. [6]

    Google, D.: Gemini 3 flash: Best for frontier intelligence at speed.https:// deepmind.google/models/gemini/flash/(2025)

  7. [7]

    Google, D.: Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life.https://deepmind.google/models/gemini/pro/(2026)

  8. [8]

    arXiv preprint arXiv:2406.12276 (2024)

    Gupta, T., Weihs, L., Kembhavi, A.: Codenav: Beyond tool-use to using real-world codebases with llm agents. arXiv preprint arXiv:2406.12276 (2024)

  9. [9]

    In: European conference on computer vision

    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European conference on computer vision. pp. 630–645. Springer (2016)

  10. [10]

    Measuring Coding Challenge Competence With APPS

    Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al.: Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021)

  11. [11]

    arXiv preprint arXiv:2501.17167 (2025)

    Hu, Y., Zhou, Q., Chen, Q., Li, X., Liu, L., Zhang, D., Kachroo, A., Oz, T., Tripp, O.: Qualityflow: An agentic workflow for program synthesis controlled by llm quality checks. arXiv preprint arXiv:2501.17167 (2025)

  12. [12]

    In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= chfJJYC3iL

    Jain, N., Han, K., Gu, A., Li, W.D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., Stoica, I.: Livecodebench: Holistic and contamination free evalua- tion of large language models for code. In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= chfJJYC3iL

  13. [13]

    In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

    Jiang, X., Dong, Y., Tao, Y., Liu, H., Jin, Z., Li, G.: Rocode: Integrating backtrack- ing mechanism and program analysis in large language models for code generation. In: ICSE. pp. 334–346 (2025),https://doi.org/10.1109/ICSE55347.2025.00133

  14. [14]

    Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K.R.: SWE-bench: Can language models resolve real-world github issues? In: The Twelfth International Conference on Learning Representations (2024),https: //openreview.net/forum?id=VTF8yNQM66

  15. [15]

    In: ICML deep learning workshop

    Koch, G., Zemel, R., Salakhutdinov, R., et al.: Siamese neural networks for one- shot image recognition. In: ICML deep learning workshop. vol. 2, pp. 1–30. Lille (2015)

  16. [16]

    arXiv preprint arXiv:2403.09029 (2024) 16 Zhilin Liu et al

    Laurençon,H.,Tronchon,L.,Sanh,V.:Unlockingtheconversionofwebscreenshots into html code with the websight dataset. arXiv preprint arXiv:2403.09029 (2024) 16 Zhilin Liu et al

  17. [17]

    In: NeurIPS (2024),http://papers.nips.cc/paper_files/ paper / 2024 / hash / 6a059625a6027aca18302803743abaa2 - Abstract - Datasets _ and_Benchmarks_Track.html

    Li, J., Li, G., Zhang, X., Zhao, Y., Dong, Y., Jin, Z., Li, B., Huang, F., Li, Y.: Evocodebench: An evolving code generation benchmark with domain- specific evaluations. In: NeurIPS (2024),http://papers.nips.cc/paper_files/ paper / 2024 / hash / 6a059625a6027aca18302803743abaa2 - Abstract - Datasets _ and_Benchmarks_Track.html

  18. [18]

    In: ACL (Findings)

    Li, J., Li, G., Zhao, Y., Li, Y., Liu, H., Zhu, H., Wang, L., Liu, K., Fang, Z., Wang, L., Ding, J., Zhang, X., Zhu, Y., Dong, Y., Jin, Z., Li, B., Huang, F., Li, Y., Gu, B., Yang, M.: Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories. In: ACL (Findings). pp. 3603–3614 (2024),https: //doi.org/10.18653/v1/202...

  19. [19]

    Science378(6624), 1092–1097 (2022)

    Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al.: Competition-level code generation with alphacode. Science378(6624), 1092–1097 (2022)

  20. [20]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

    Lu, Z., Yang, Y., Ren, H., Hou, H., Xiao, H., Wang, K., Shi, W., Zhou, A., Zhan, M., Li, H.: Webgen-bench: Evaluating llms on generating interactive and func- tional websites from scratch. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  21. [21]

    arXiv preprint arXiv:2412.13501 (2024)

    Nguyen, D., Chen, J., Wang, Y., Wu, G., Park, N., Hu, Z., Lyu, H., Wu, J., Aponte, R., Xia, Y., et al.: Gui agents: A survey. arXiv preprint arXiv:2412.13501 (2024)

  22. [22]

    com / zh - Hans - CN / index / introducing-gpt-5-2/(2025)

    OpenAI: Introducing gpt-5.2.https : / / openai . com / zh - Hans - CN / index / introducing-gpt-5-2/(2025)

  23. [23]

    CausalGraph2LLM: Evaluating LLMs for causal queries

    Si, C., Zhang, Y., Li, R., Yang, Z., Liu, R., Yang, D.: Design2code: Benchmark- ing multimodal code generation for automated front-end engineering. In: NAACL (Long Papers). pp. 3956–3974 (2025),https://doi.org/10.18653/v1/2025. naacl-long.199

  24. [24]

    In: Forty-second International Conference on Machine Learning (2025)

    Sohrabizadeh, A., Song, J., Liu, M., Roy, R., Lee, C., Raiman, J., Catanzaro, B.: Nemotron-CORTEXA: Enhancing LLM agents for software engineering tasks via improved localization and solution diversity. In: Forty-second International Conference on Machine Learning (2025)

  25. [25]

    arXiv preprint arXiv:2508.03923 (2025)

    Song, L., Dai, Y., Prabhu, V., Zhang, J., Shi, T., Li, L., Li, J., Savarese, S., Chen, Z., Zhao, J., et al.: Coact-1: Computer-using agents with coding as actions. arXiv preprint arXiv:2508.03923 (2025)

  26. [26]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Sun, Y., Zhao, S., Yu, T., Wen, H., Va, S., Xu, M., Li, Y., Zhang, C.: Gui-xplore: Empowering generalizable gui agents with one exploration. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19477–19486 (2025)

  27. [27]

    Team, C.: Cursor document.https://cursor.com/cn/docs(2025)

  28. [28]

    Team, G.C.: Build, debug & deploy with ai.https://geminicli.com/(2025)

  29. [29]

    Kimi K2.5: Visual Agentic Intelligence

    Team, K., Bai, T., Bai, Y., Bao, Y., Cai, S., Cao, Y., Charles, Y., Che, H., Chen, C., Chen, G., et al.: Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276 (2026)

  30. [30]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Wang, H., Zou, H., Song, H., Feng, J., Fang, J., Lu, J., Liu, L., Luo, Q., Liang, S., Huang, S., et al.: Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544 (2025)

  31. [31]

    In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= XLMAMmowdY

    Wang, R., Han, X., Ji, L., Wang, S., Baldwin, T., Li, H.: Toolgen: Unified tool retrieval and calling via generation. In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= XLMAMmowdY

  32. [32]

    Proceedings of the ACM on Software Engineering 2(ISSTA), 1123–1145 (2025) Visual Feedback Unlocks Reliable GUI Code Generating and Debugging 17

    Wu, F., Gao, C., Li, S., Wen, X.C., Liao, Q.: Mllm-based ui2code automation guided by ui layout information. Proceedings of the ACM on Software Engineering 2(ISSTA), 1123–1145 (2025) Visual Feedback Unlocks Reliable GUI Code Generating and Debugging 17

  33. [33]

    arXiv preprint arXiv:2411.03292 (2024)

    Xiao, J., Wan, Y., Huo, Y., Wang, Z., Xu, X., Wang, W., Xu, Z., Wang, Y., Lyu, M.R.: Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. arXiv preprint arXiv:2411.03292 (2024)

  34. [34]

    arXiv preprint arXiv:2505.07473 (2025)

    Xu, K., Mao, Y., Guan, X., Feng, Z.: Web-bench: A llm code benchmark based on web standards and frameworks. arXiv preprint arXiv:2505.07473 (2025)

  35. [35]

    arXiv preprint arXiv:2507.05791 (2025)

    Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., et al.: Gta1: Gui test-time scaling agent. arXiv preprint arXiv:2507.05791 (2025)

  36. [36]

    Ye, H., Yang, A.Z., Hu, C., Wang, Y., Zhang, T., Le Goues, C.: Adverintent- agent: Adversarial reasoning for repair based on inferred program intent. Proc. ACM Softw. Eng.2(ISSTA) (Jun 2025).https://doi.org/10.1145/3728939, https://doi.org/10.1145/3728939

  37. [37]

    In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

    Yuan, M., Chen, J., Xing, Z., Quigley, A., Luo, Y., Luo, T., Mohammadi, G., Lu, Q., Zhu, L.: Designrepair: Dual-stream design guideline-aware frontend repair with large language models. In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). p. 2483–2494. IEEE (Apr 2025).https://doi. org/10.1109/icse55347.2025.00109,http://dx.doi.o...

  38. [38]

    In: ACL (1)

    Zhang, K., Li, J., Li, G., Shi, X., Jin, Z.: Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. In: ACL (1). pp. 13643–13658 (2024),https://doi.org/10.18653/v1/2024.acl- long.737

  39. [39]

    arXiv preprint arXiv:2305.04087 (2023)

    Zhang, K., Li, Z., Li, J., Li, G., Jin, Z.: Self-edit: Fault-aware code editor for code generation. arXiv preprint arXiv:2305.04087 (2023)

  40. [40]

    arXiv preprint arXiv:2506.15655 (2025)

    Zhang, Y., Zhao, X., Wang, Z.Z., Yang, C., Wei, J., Wu, T.: cast: Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree. arXiv preprint arXiv:2506.15655 (2025)

  41. [41]

    arXiv preprint arXiv:2512.22047 (2025)

    Zhou, H., Zhang, X., Tong, P., Zhang, J., Chen, L., Kong, Q., Cai, C., Liu, C., Wang, Y., Zhou, J., et al.: Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047 (2025)

  42. [42]

    Direct”, “Text-Augmented

    Zhu, H., Zhang, Y., Zhao, B., Ding, J., Liu, S., Liu, T., Wang, D., Liu, Y., Li, Z.: Frontendbench: A benchmark for evaluating llms on front-end development via automatic evaluation. arXiv preprint arXiv:2506.13832 (2025) Visual Feedback Unlocks Reliable GUI Code Generating and Debugging 1 A More discussions of Proposed InteractGUI Bench In this section, ...

  43. [43]

    For icons or images that cannot be implemented, you may use rectangular boxes as placeholders

    1:1 reproduces all the visual contents from the screenshots (including layout, colors, fonts, spacing, text content and all UI elements). For icons or images that cannot be implemented, you may use rectangular boxes as placeholders

  44. [44]

    Implements all the interaction logic mentioned in the instructions

  45. [45]

    Generates complete, directly runnable code without any syntax errors {instruction} Requirements:

  46. [46]

    • If you generate multiple files, all files must be placed in the same directory • You are NOT allowed to create subdirectories

    File Structure • You are allowed to implement all functionality in a single code file, or generate multiple code files, no matter what, you must use main.py as the entry point (ensure that running python main.py can start the GUI program). • If you generate multiple files, all files must be placed in the same directory • You are NOT allowed to create subd...

  47. [47]

    You are recommended to use setAccessibleName() to set the accessible name

    Component Naming • Please strictly use the element names marked with single quotes '' in the instructions to name components such as icon buttons, text buttons, input boxes, etc., ensuring these elements can be accessed through accessibility tools. You are recommended to use setAccessibleName() to set the accessible name

  48. [48]

    Output Format • Each file must use independent code blocks, formatted as: ```python filename="main.py" # code here ``` Fig. 13:The prompt used for base models (e.g., GPT-5.2, Claude-4.5-Sonnet) to gen- erate GUI apps on InteractGUI Bench Prompt: Your Task You are a GUI programming assistant responsible for executing complex GUI program debugging tasks. Yo...

  49. [49]

    Verify that the interaction logic mentioned in the instructions can be correctly executed

    Test interaction logic according to instruction requirements: Follow the navigation path described in the instructions (e.g., starting from a specific page, clicking certain buttons/elements, navigating to target pages). Verify that the interaction logic mentioned in the instructions can be correctly executed. If any step in the navigation path fails (e.g...

  50. [50]

    Check visual consistency with reference images: If a reference page screenshot is provided, compare the current page with the reference image to check if the visual layout, element positions, and styles are consistent. **Focus only on structural visual information (layout, element positions, styles), and do NOT check content- related elements such as imag...

  51. [51]

    If scrolling operation (scroll) is ineffective → Still cannot find the required content, indicating the interface may not support scrolling or is out of screen range, should immediately report through action=terminate

  52. [52]

    If clicking operation (click) is ineffective → Indicates button failure or event not bound, should immediately report the specific problem through action=terminate

  53. [53]

    name": <function-name>,

    Do not try the same operation a third time, do not look for other reasons to continue, must immediately terminate and report Output format: Thought: You should think step by step and provide a detailed thought process before generating the next action. Action: - You need to use functions within <tools></tools> XML tags to perform actions based on observat...

  54. [54]

    Carefully analyze bug descriptions, reference page screenshot and the rendering interface screenshot when error occurs to understand the root cause of problems

  55. [55]

    Locate the parts that need modification in the source code

  56. [56]

    Use SEARCH/REPLACE blocks for precise incremental modifications

  57. [57]

    Ensure fixes do not introduce new problems and the visual rendering effect is as consistent as possible with the reference page screenshot Output Format Requirements - SEARCH/REPLACE Blocks Use the following format for code modifications (this avoids regenerating the entire file, saves tokens, and reduces error risk): filename.ext ```language <<<<<<< SEAR...

  58. [58]

    Exact Match: The SEARCH part must exactly match the file content character by character (including spaces, indentation, comments)

  59. [59]

    Replace Once: Each SEARCH/REPLACE block will only replace the first occurrence

  60. [60]

    Multiple Modifications: If multiple modifications are needed, use multiple independent blocks

  61. [61]

    Keep It Concise: Only include the lines that need modification and a small amount of context (2-5 lines), do not include large unchanged code sections

  62. [62]

    16:The prompt used for VF-Coder’s Code Fixer

    Create New Files: Leave the SEARCH part empty, put the new file content in the REPLACE part Fig. 16:The prompt used for VF-Coder’s Code Fixer