arxiv: 2604.18591 · v1 · submitted 2026-03-18 · 💻 cs.HC · cs.AI

Recognition: no theorem link

SPRITE: From Static Mockups to Engine-Ready Game UI

Yunshu Bai , RuiHao Li , Hao Zhang , Chien Her Lim , Ming Yan , Mengtian Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:01 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords game UIscreenshot-to-codevision-language modelsYAMLengine assetsUI developmentautomation

0 comments

The pith

SPRITE converts static game UI screenshots into editable engine assets by combining vision-language models with structured YAML.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SPRITE as a pipeline that turns stylized game interface mockups into interactive engine-ready entities. Existing screenshot-to-code methods often fail when faced with non-rectangular shapes and deeply nested visual structures common in games. SPRITE addresses this by feeding vision-language models a YAML-based intermediate format that explicitly records container relationships and irregular layouts. Tests on a dedicated game UI benchmark plus reviews by professional developers indicate that the approach reduces manual coding and improves nesting accuracy. The result is faster movement from artistic mockup to playable in-engine prototype.

Core claim

SPRITE is a pipeline that transforms static screenshots into editable engine assets by integrating Vision-Language Models with a structured YAML intermediate representation, which explicitly captures complex container relationships and non-rectangular layouts, as shown by improved reconstruction fidelity on a curated Game UI benchmark and positive expert assessments of prototyping efficiency.

What carries the argument

The SPRITE pipeline, which uses Vision-Language Models guided by a structured YAML representation to capture container relationships and non-rectangular layouts in game interfaces.

If this is right

Automates tedious coding tasks for game UI implementation.
Resolves complex nesting and irregular geometry issues in UI layouts.
Facilitates rapid in-engine iteration and prototyping.
Blurs boundaries between artistic design and technical implementation in game development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar approaches could be adapted for complex UIs in non-game sectors like industrial controls or mobile apps.
Enhanced VLM capabilities could enable handling of dynamic or animated UI elements in future iterations.
Direct integration with popular game engines might allow seamless asset import and further reduce development time.

Load-bearing premise

Vision-language models guided by a structured YAML representation can reliably capture the irregular geometries and deep visual hierarchies typical of game interfaces.

What would settle it

A benchmark test on a highly complex game UI screenshot where the output engine assets fail to accurately replicate the nesting structure or non-rectangular shapes when imported and rendered.

Figures

Figures reproduced from arXiv: 2604.18591 by Chien Her Lim, Hao Zhang, Mengtian Li, Ming Yan, RuiHao Li, Yunshu Bai.

**Figure 1.** Figure 1: The SPRITE System.Transforming a raw GameUI screenshot (left) into editable engine assets (right). Unlike standard [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: SPRITE Our system transforms mockups into engine assets via three stages: (1) Semantic Scaffolding, VLM infers a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: System Prompt: UI Master Persona 3.3.1 Prompt Engineering and Visual Perception. For high-level semantic parsing and coarse component identification, we employ Qwen3-VL [2]. This initial parsing is driven by a carefully crafted system prompt (the “UI Master Persona”). Our prompt design follows three core rationales: (1) Functional Decoupling to force the VLM to filter aesthetic noise and isolate core UI … view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison. While VLMs (a-b) are limited to bounding boxes and the baseline (c) suffers from fragmenta [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Visual representation of the GAMEUI Benchmark gallery. These representative samples demonstrate the system’s [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Game UI implementation requires translating stylized mockups into interactive engine entities. However, current "Screenshot-to-Code" tools often struggle with the irregular geometries and deep visual hierarchies typical of game interfaces. To bridge this gap, we introduce SPRITE, a pipeline that transforms static screenshots into editable engine assets. By integrating Vision-Language Models (VLMs) with a structured YAML intermediate representation, SPRITE explicitly captures complex container relationships and non-rectangular layouts. We evaluated SPRITE against a curated Game UI benchmark and conducted expert reviews with professional developers to assess reconstruction fidelity and prototyping efficiency. Our findings demonstrate that SPRITE streamlines development by automating tedious coding and resolving complex nesting. By facilitating rapid in-engine iteration, SPRITE effectively blurs the boundaries between artistic design and technical implementation in game development. Project page: https://baiyunshu.github.io/sprite.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPRITE offers a targeted pipeline for game UI screenshots using VLMs and YAML, but the evaluation provides no metrics or baselines to support the claims about handling complex layouts.

read the letter

The main takeaway is that SPRITE turns game UI mockups into engine assets by routing screenshots through vision-language models and a custom YAML layer to manage nesting and non-rectangular shapes. This combination is new for the game-specific case, where general screenshot-to-code tools often fall short on irregular geometries and deep hierarchies. The YAML intermediate looks like a practical choice for producing editable output that developers can tweak directly in the engine. The paper also tries to ground the work with a curated benchmark and feedback from professional developers, which at least points to real workflow pain points in game production. What it does well is focus on the exact mismatches between artistic mockups and technical implementation that slow down iteration. The stress-test concern about VLM reliability on curved elements and deep nesting is fair, but the abstract does not supply error rates, ablation studies, or failure examples, so the efficiency gains remain unquantified. Without those details it is hard to tell whether the pipeline actually resolves the hard cases or just works on simpler examples. This paper is for researchers and tool builders in HCI and game development who are exploring AI assistance for creative pipelines. A reader already working on UI generation or structured output formats could extract usable ideas about the intermediate representation. It deserves a serious referee because the core system is concrete and the problem is well-defined, even if the current results section needs more data to stand up to scrutiny. I would send it for peer review so the authors can add the missing quantitative comparisons and failure analysis.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SPRITE, a pipeline that integrates vision-language models with a structured YAML intermediate representation to convert static game UI screenshots into editable engine assets. It claims to better handle irregular geometries and deep visual hierarchies than existing screenshot-to-code tools, with positive outcomes shown on a curated Game UI benchmark and expert reviews by professional developers assessing reconstruction fidelity and prototyping efficiency.

Significance. If the results hold, SPRITE could reduce manual coding effort in game UI implementation and enable faster design-to-engine iteration. The explicit YAML capture of container relationships and non-rectangular layouts is a constructive design choice that addresses a known pain point in game development tooling.

major comments (2)

[Abstract] Abstract: the central claim that SPRITE 'streamlines development by automating tedious coding and resolving complex nesting' rests on benchmark and expert-review results, yet the abstract (and manuscript) supplies no quantitative metrics such as layout-detection accuracy, geometry-reconstruction error rates, failure-case analysis, or baseline comparisons against prior screenshot-to-code systems.
[Evaluation] Evaluation description: no details are given on how reconstruction fidelity was measured (e.g., pixel-level overlap, hierarchy-edit distance, or engine-asset validity), nor are ablations or error breakdowns provided for VLM hallucinations on curved elements, overlapping panels, or deep nesting—the precise failure modes highlighted as the motivating challenge.

minor comments (1)

[Abstract] The project page URL is given but no supplementary material (code, benchmark dataset, or prompt templates) is referenced in the text; adding such links would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of SPRITE to reduce manual coding effort in game UI development. We appreciate the positive note on the YAML intermediate representation. Below we provide point-by-point responses to the major comments and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that SPRITE 'streamlines development by automating tedious coding and resolving complex nesting' rests on benchmark and expert-review results, yet the abstract (and manuscript) supplies no quantitative metrics such as layout-detection accuracy, geometry-reconstruction error rates, failure-case analysis, or baseline comparisons against prior screenshot-to-code systems.

Authors: We acknowledge this observation. While the full manuscript presents results from the curated Game UI benchmark and expert reviews, the abstract does not include specific quantitative figures. In the revised version, we will update the abstract to include key metrics such as layout-detection accuracy, geometry reconstruction performance, and comparisons to existing screenshot-to-code systems to better substantiate the central claims. revision: yes
Referee: [Evaluation] Evaluation description: no details are given on how reconstruction fidelity was measured (e.g., pixel-level overlap, hierarchy-edit distance, or engine-asset validity), nor are ablations or error breakdowns provided for VLM hallucinations on curved elements, overlapping panels, or deep nesting—the precise failure modes highlighted as the motivating challenge.

Authors: We agree that more explicit details are needed. We will expand the Evaluation section to describe precisely how reconstruction fidelity was assessed, incorporating metrics like pixel-level overlap, hierarchy-edit distance, and checks for engine-asset validity. We will also add ablations and error analyses focusing on VLM hallucinations for curved elements, overlapping panels, and deep nesting to directly address the key challenges outlined in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity; pipeline and evaluation are externally grounded

full rationale

The paper presents SPRITE as a new pipeline that combines VLMs with a YAML intermediate representation to convert game UI screenshots into engine assets. The central claims rest on a curated external benchmark plus independent expert developer reviews for fidelity and efficiency, with no equations, fitted parameters, or self-citations that reduce the reported outcomes to the inputs by construction. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces an applied system without mathematical derivations, free parameters, or formal axioms; the core reliance is on the assumed capabilities of existing vision-language models.

invented entities (1)

SPRITE pipeline no independent evidence
purpose: Transform static game UI screenshots into editable engine assets via VLM parsing and YAML representation
The pipeline is the central new artifact introduced by the paper.

pith-pipeline@v0.9.0 · 5455 in / 1121 out tokens · 62583 ms · 2026-05-15T09:01:11.128017+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 5 internal anchors

[1]

Anthropic. 2025. Introducing Claude Sonnet 4.5. https://www.anthropic.com/ news/claude-sonnet-4-5. Accessed: 2026-01-20

work page 2025
[2]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xiong-Hui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Rongyao Fang, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayihen...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Tony Beltramelli. 2018. pix2code: Generating code from a graphical user inter- face screenshot. InProceedings of the ACM SIGCHI symposium on engineering interactive computing systems. Association for Computing Machinery, New York, NY, USA, 1–6

work page 2018
[4]

Sacha Brisset, Romain Rouvoy, Lionel Seinturier, and Renaud Pawlak. 2021. Erratum: Leveraging Flexible Tree Matching to Repair Broken Locators in Web SPRITE: From Static Mockups to Engine-Ready Game UI Automation Scripts.ArXivabs/2106.04916 (2021), 1–34

work page arXiv 2021
[5]

Sara Bunian, Kai Li, Chaima Jemmali, Casper Harteveld, Yun Fu, and Magy Seif Seif El-Nasr. 2021. VINS: Visual Search for Mobile User Interface Design. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan)(CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 423, 14 pages. doi:10.1145/3411764.3445762

work page doi:10.1145/3411764.3445762 2021
[6]

Plataniotis

Zhixiang Chi, Yanan Wu, Li Gu, Huan Liu, Ziqiang Wang, Yang Zhang, Yang Wang, and Konstantinos N. Plataniotis. 2025. Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation.ArXiv abs/2508.20265 (2025), 1–42

work page arXiv 2025
[7]

Niraj Ramesh Dayama, Simo Santala, Lukas Brückner, Kashyap Todi, Jingzhou Du, and Antti Oulasvirta. 2021. Interactive Layout Transfer. InProceedings of the 26th International Conference on Intelligent User Interfaces(College Station, TX, USA)(IUI ’21). Association for Computing Machinery, New York, NY, USA, 70–80. doi:10.1145/3397481.3450652

work page doi:10.1145/3397481.3450652 2021
[8]

Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A Mobile App Dataset for Building Data-Driven Design Applications. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology(Québec City, QC, Canada)(UIST ’17). Association for Computing Machin...

work page doi:10.1145/3126594.3126651 2017
[9]

Zhen Feng, Jiaqi Fang, Bo Cai, and Yingtao Zhang. 2021. GUIS2Code: A Computer Vision Tool to Generate Code Automatically from Graphical User Interface Sketches. InProceedings of the 30th International Conference on Artificial Neural Networks (ICANN)(Bratislava, Slovakia). Springer-Verlag, Berlin, Heidelberg, 53–65. doi:10.1007/978-3-030-86365-4_5

work page doi:10.1007/978-3-030-86365-4_5 2021
[10]

Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S Davis, Vijay Mahade- van, and Abhinav Shrivastava. 2021. Layouttransformer: Layout generation and completion with self-attention. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV). IEEE/CVF, Montreal, QC, Canada, 1004–1014

work page 2021
[11]

Lim Chien Her, Ming Yan, Yunshu Bai, Ruihao Li, and Hao Zhang. 2025. Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation.arXiv preprint arXiv:2512.10501abs/2512.10501 (2025), 1–12

work page arXiv 2025
[12]

Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yam- aguchi. 2023. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE/CVF, Vancouver, BC, Canada, 10167–10176

work page 2023
[13]

Lyu, and Xiangyu Yue

Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, and Xiangyu Yue. 2025. ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents.ArXivabs/2507.22827 (2025), 1–20

work page arXiv 2025
[14]

Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics (NRL)2, 1-2 (1955), 83–97

work page 1955
[15]

Talton, Salman Ahmad, and Scott R

Ranjitha Kumar, Jerry O. Talton, Salman Ahmad, and Scott R. Klemmer. 2011. Bricolage: example-based retargeting for web design. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Vancouver, BC, Canada) (CHI ’11). Association for Computing Machinery, New York, NY, USA, 2197–2206. doi:10.1145/1978942.1979262

work page doi:10.1145/1978942.1979262 2011
[16]

Black Forest Labs, Stephen Batifol, A. Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Muller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. 2025. FLUX.1 Kontext: Flow Matching for In-Context Image...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Hugo Laurençon, Léo Tronchon, and Victor Sanh. 2024. Unlocking the conver- sion of Web Screenshots into HTML Code with the WebSight Dataset.ArXiv abs/2403.09029 (2024), 1–9

work page arXiv 2024
[18]

Triet Huynh Minh Le, Hao Chen, and Muhammad Ali Babar. 2020. Deep Learning for Source Code Modeling and Generation.ACM Computing Surveys (CSUR)53 (2020), 1 – 38

work page 2020
[19]

Raymond Li, Loubna Ben Allal, Yangtian Zi, et al . 2023. StarCoder: may the source be with you!Trans. Mach. Learn. Res.2023 (2023), 1–55

work page 2023
[20]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun- yuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499abs/2303.0549 (2023), 1–33

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Yuwen Lu, Alan Leung, Amanda Swearngin, Jeffrey Nichols, and Titus Barik. 2025. Misty: UI Prototyping Through Interactive Conceptual Blending. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 1108, 17 pages. doi:10.1145/3706598.3713924

work page doi:10.1145/3706598.3713924 2025
[22]

Tuan Anh Nguyen and Christoph Csallner. 2015. Reverse Engineering Mobile Ap- plication User Interfaces with REMAUI (T). InProceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Lincoln, NE, USA, 248–259

work page 2015
[23]

Akshay Gadi Patil, Omri Ben-Eliezer, Or Perel, and Hadar Averbuch-Elor. 2020. READ: Recursive Autoencoders for Document Layout Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, Seattle, WA, USA, 2316–2325

work page 2020
[24]

Akshay Gadi Patil, Manyi Li, Matthew Fisher, Manolis Savva, and Hao Zhang

work page
[25]

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR)

LayoutGMN: Neural Graph Matching for Structural Layout Similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). IEEE/CVF, Nashville, TN, USA, 11043–11052

work page
[26]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya K. Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloé Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross B. Girshick, Piotr Doll’ar, and Christoph Feichtenhofer. 2024. SAM 2: Segment Anything in Images and Videos.ArXivabs/2408.00...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang

work page
[28]

InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Design2code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics, Albuquerque, New Mexico, USA, 3956–3974

work page 2025
[29]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al

work page
[30]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267abs/2601.03267 (2025), 1–61

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Davit Soselia, Khalid Saifullah, and Tianyi Zhou. 2023. Learning UI-to-Code Reverse Generator Using Visual Critic Without Rendering.arXiv preprint arXiv:2305.14637abs/2305.14637 (2023), 1–10

work page arXiv 2023
[32]

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2022. Resolution-robust large mask inpainting with fourier convolutions. InProceedings of the IEEE/CVF winter conference on applications of computer vision. IEEE/CVF, Waikoloa, HI, US...

work page 2022
[33]

Zhongliang Tang, Mengchen Tan, Fei Xia, Qingrong Cheng, Hao Jiang, and Yongxiang Zhang. 2024. AutoGameUI: Constructing High-Fidelity Game UIs via Multimodal Learning and Interactive Web-Based Tool.arXiv preprint arXiv:2411.03709abs/2411.03709 (2024), 1–9

work page arXiv 2024
[34]

Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael R. Lyu. 2024. Automatically Generating UI Code from Screen- shot: A Divide-and-Conquer-Based Approach.ArXivabs/2406.16386 (2024), 241–253

work page arXiv 2024
[35]

Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Gérard Dray, and Walid Maalej. 2025. On AI-Inspired User Interface Design.IEEE Software42, 3 (2025), 50–58. doi:10.1109/MS.2025.3536838

work page doi:10.1109/ms.2025.3536838 2025
[36]

Fan Wu, Cuiyun Gao, Shuqing Li, Xinjie Wen, and Qing Liao. 2025. MLLM-Based UI2Code Automation Guided by UI Layout Information.Proceedings of the ACM on Software Engineering2 (2025), 1123 – 1145

work page 2025
[37]

Pengfei Xu, Yifan Li, Zhijin Yang, Weiran Shi, Hongbo Fu, and Hui Huang. 2022. Hierarchical Layout Blending with Recursive Optimal Correspondence.ACM Transactions on Graphics (TOG)41 (2022), 1 – 15

work page 2022
[38]

Yong Xu, Lili Bo, Xiaobing Sun, Bin Li, Jing Jiang, and Wei Zhou. 2021. im- age2emmet: Automatic code generation from web user interface image.Journal of Software: Evolution and Process33 (2021), 241–253

work page 2021
[39]

Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, and Hai Rao. 2025. UI-UG: A Unified MLLM for UI Understanding and Generation.ArXivabs/2509.24361 (2025), 1–16

work page arXiv 2025
[40]

Houston H Zhang, Tao Zhang, Baoze Lin, Yuanqi Xue, Yincheng Zhu, Huan Liu, Li Gu, Linfeng Ye, Ziqiang Wang, Xinxin Zuo, et al. 2025. Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs.arXiv preprint arXiv:2512.19918 2512.19918 (2025), 1–25

work page arXiv 2025
[41]

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. Publaynet: largest dataset ever for document layout analysis. In2019 International conference on document analysis and recognition (ICDAR). IEEE, Sydney, NSW, Australia, 1015– 1022

work page 2019
[42]

Ti Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang. 2025. DeclarUI: Bridging Design and Development with Automated Declarative UI Code Generation.Proceedings of the ACM on Software Engineering2 (2025), 219 – 241

work page 2025