arxiv: 2604.09442 · v1 · submitted 2026-04-10 · 💻 cs.CL

Recognition: unknown

UIPress: Bringing Optical Token Compression to UI-to-Code Generation

Dasen Dai , Shuoqi Li , Ronghao Chen , Huacan Wang , Biao Wu , Qizhen Lan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords UI-to-Code generationvisual token compressionvision-language modelslearned optical compressionLoRA adaptationDesign2CodeViT encoderprefill latency

0 comments

The pith

UIPress inserts a learned compression module between ViT encoder and LLM decoder to reduce UI screenshot tokens from ~6700 to a fixed 256 while raising generation quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that optical, encoder-side compression can solve the token explosion problem in UI-to-Code generation, where vision-language models must turn one screenshot into thousands of HTML/CSS tokens. UIPress uses depthwise-separable convolutions, element-guided reweighting, and Transformer refinement to produce the short sequence, then applies LoRA on the decoder to close any representation gap. On the Design2Code benchmark the method beats both the uncompressed baseline and prior inference-time selection techniques in CLIP score while cutting time-to-first-token by 9.1 times. A reader cares because the approach keeps the visual encoder frozen and adds only 0.26 percent extra parameters, making accurate structured code generation far more practical for real UI designs.

Core claim

UIPress is the first encoder-side learned compression method for UI-to-Code: a lightweight module placed after the frozen ViT encoder of Qwen3-VL-8B that compresses approximately 6700 visual tokens to a fixed budget of 256 using depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement, then fine-tunes the LLM decoder with LoRA; under identical base-model conditions it reaches a CLIP score of 0.8127 on Design2Code, surpassing the uncompressed baseline by 7.5 percent and the best inference-time baseline by 4.6 percent while delivering a 9.1 times reduction in time-to-first-token.

What carries the argument

The UIPress compression module, which learns to map full ViT visual token sequences into a fixed 256-token representation via depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement before the LLM decoder, supplemented by LoRA adaptation on the decoder.

If this is right

Reduces prefill latency enough to support real-time UI prototyping workflows.
Demonstrates that task-specific optical compression can outperform both uncompressed and heuristic selection baselines on the same base model.
Adds only 21.7 million trainable parameters (0.26 percent of the 8B model) while preserving or improving output quality.
Establishes encoder-side learned compression as viable for UI-to-Code where prior methods either kept full token length or used task-agnostic heuristics.
Opens the possibility of applying the same compression pattern to other long-output vision-to-code or vision-to-text tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compression pattern could be tested on chart-to-code or document-to-structured-data tasks that also suffer from high visual token counts.
Fixed-budget compression may need to be replaced by adaptive budgets that scale with UI complexity to avoid under-compression on dense screens.
Because the encoder stays frozen, the method could be ported to other VLMs without retraining the entire vision backbone.
Lower token counts may enable larger batch sizes during serving, improving throughput for UI generation services beyond the reported single-example speedup.

Load-bearing premise

That a fixed 256-token compression still supplies enough visual detail for the LLM to produce accurate structured HTML and CSS across varied UI designs, and that LoRA can reliably close any gap created by the compression step.

What would settle it

Measuring whether CLIP score and code-visual alignment drop sharply when the same model is tested on a held-out set of highly detailed or interactive UIs whose layout complexity exceeds the Design2Code training distribution.

Figures

Figures reproduced from arXiv: 2604.09442 by Biao Wu, Dasen Dai, Huacan Wang, Qizhen Lan, Ronghao Chen, Shuoqi Li.

**Figure 2.** Figure 2: Overview of the UIPress framework. A frozen ViT encoder produces 𝑁>6,000 dense visual tokens V. The Optical Compressor consists of four stages: (1) Convolutional Compression—two cascaded depthwise-separable (DW) and pointwise (PW) convolution blocks with GroupNorm and GELU, yielding 4× spatial downsampling; (2) Element-Guided Token Reweighting—an attention mask M derived from OmniParser V2 assigns category… view at source ↗

**Figure 3.** Figure 3: Token count ablation. (a) CLIP score exhibits a sharp [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics of UIPress over 20 epochs on 50K WebSight samples. (a) CLIP score on the 50-page Design2Code validation set, rising from 0.7232 (random compressor init) to a peak of 0.8127 at epoch 17, surpassing both the uncompressed baseline (0.7563, dashed) and resolution scaling (0.7768, dotted). (b) Training loss decreases monotonically, indicating no overfitting. (c) Learning rate follows a cosine … view at source ↗

**Figure 5.** Figure 5: Compression–quality trade-off across all meth [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on the ZOOM page from Design2Code. From left to right: (a) Ground truth screenshot; (b) UIPress-256 (CLIP 79.28); (c) Resolution scaling 480px (CLIP 74.03); (d) Baseline uncompressed (CLIP 72.18). UIPress best preserves the sidebar layout, content density, and section-level styling despite using only 256 visual tokens. Interpretation. Under this view, the goal of the Optical Compres… view at source ↗

**Figure 7.** Figure 7: The system prompt used for the screenshot-to-code generation task. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence -- neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress ${\sim}$6{,}700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only ${\sim}$21.7M trainable parameters (0.26\% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5\% and the strongest inference-time method by +4.6\%, while delivering 9.1$\times$ time-to-first-token speedup. To the best of our knowledge, UIPress is the first encoder-side learned compression method for the UI-to-Code task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UIPress adds a lightweight encoder-side compressor for UI screenshots that cuts tokens to 256 and reports better CLIP scores plus big TTFT speedups, but the baseline setup leaves open whether the gains come from compression or from the added LoRA adaptation.

read the letter

The main point is that this paper inserts a small learned module after the frozen ViT in Qwen3-VL-8B to compress roughly 6700 visual tokens down to a fixed 256 for UI-to-code generation. It combines depthwise separable convolutions, element-guided reweighting, and a transformer refinement step, then adds LoRA on the decoder to close any gap. On Design2Code it claims a CLIP score of 0.8127, which is 7.5% above the uncompressed baseline and 4.6% above the best inference-time method, along with a 9.1x reduction in time-to-first-token, all while training only 21.7M extra parameters.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces UIPress, a lightweight learned compression module placed between the frozen ViT encoder and LLM decoder of Qwen3-VL-8B for UI-to-Code generation. It reduces ~6700 visual tokens to a fixed budget of 256 tokens using depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement, combined with LoRA on the decoder (adding ~21.7M trainable parameters, or 0.26% of the base model). On the Design2Code benchmark, UIPress reports a CLIP score of 0.8127, claiming +7.5% over the uncompressed baseline, +4.6% over the strongest inference-time baseline, and 9.1× time-to-first-token speedup, while asserting it is the first encoder-side learned compression method for this task.

Significance. If the reported gains can be isolated to the compression module under controlled conditions, the work would demonstrate a practical advance in token-efficient vision-language modeling for structured output generation from UI screenshots. The combination of encoder-side optical compression with minimal adaptation parameters offers a promising direction beyond inference-time heuristics, with clear efficiency benefits for prefill latency.

major comments (2)

[Abstract / Experiments] Abstract and experimental comparison: The central claim of +7.5% CLIP improvement and 9.1× TTFT speedup over the uncompressed baseline on the same Qwen3-VL-8B model is load-bearing, yet the abstract does not specify whether this baseline receives the same LoRA fine-tuning applied to UIPress 'to bridge the representation gap.' If the baseline is the frozen model while UIPress is adapted, the gains cannot be attributed to the compression module (depthwise-separable convs + reweighting + Transformer) rather than adaptation.
[Experiments] Experimental section: Concrete numbers (CLIP 0.8127, +7.5%, +4.6%, 9.1×) are presented without details on baseline implementations, exact data splits, variance across runs, or statistical tests. This omission prevents verification that the outperformance over four baselines is robust and isolates the proposed compression technique.

minor comments (2)

[Abstract] Abstract: The 'to the best of our knowledge' claim of being the first encoder-side learned compression for UI-to-Code should be supported by a more explicit related-work comparison in the main text.
[Abstract] Notation: Ensure consistent rendering of the approximate token count (~6,700) and parameter count throughout the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where needed to improve the rigor and transparency of our experimental claims.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental comparison: The central claim of +7.5% CLIP improvement and 9.1× TTFT speedup over the uncompressed baseline on the same Qwen3-VL-8B model is load-bearing, yet the abstract does not specify whether this baseline receives the same LoRA fine-tuning applied to UIPress 'to bridge the representation gap.' If the baseline is the frozen model while UIPress is adapted, the gains cannot be attributed to the compression module (depthwise-separable convs + reweighting + Transformer) rather than adaptation.

Authors: We appreciate this important observation on isolating the contribution of the compression module. The LoRA adaptation is introduced specifically to bridge the representation gap between the compressed visual tokens and the LLM decoder, as stated in the manuscript. The uncompressed baseline refers to the original Qwen3-VL-8B model in its standard pre-trained configuration without additional adaptation, which is the conventional reference point for demonstrating improvements from new modules. However, we acknowledge that the abstract and experimental description do not explicitly clarify this setup, which could lead to ambiguity in attributing gains. We will revise the abstract and the experimental section (including Table 1 and surrounding text) to explicitly describe the configuration of every baseline, including the uncompressed case. We will also add results for an uncompressed baseline fine-tuned with the identical LoRA setup to more cleanly isolate the effect of the proposed encoder-side compression. revision: yes
Referee: [Experiments] Experimental section: Concrete numbers (CLIP 0.8127, +7.5%, +4.6%, 9.1×) are presented without details on baseline implementations, exact data splits, variance across runs, or statistical tests. This omission prevents verification that the outperformance over four baselines is robust and isolates the proposed compression technique.

Authors: We agree that additional experimental details are necessary for reproducibility and verification. The full manuscript contains descriptions of the baselines (Section 4.2) and the Design2Code benchmark, but we will expand this section to provide: explicit implementation details for each of the four baselines (including any inference-time heuristics or token selection methods), the precise train/validation/test splits used, and any available run-to-run variance. Where multiple runs were performed, we will report standard deviations; for statistical significance, we will include appropriate tests (e.g., paired t-tests) on the CLIP scores. These additions will be incorporated into the revised experimental section and supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results are measured outcomes, not derived by construction

full rationale

The paper proposes UIPress as an encoder-side compression module (depthwise-separable convs + reweighting + Transformer) inserted into Qwen3-VL-8B, with LoRA on the decoder, and reports measured CLIP scores (0.8127) and speedups (9.1×) on the Design2Code benchmark against baselines. The token budget of 256 is an explicit fixed design choice, not a fitted parameter whose value is then 'predicted' or forced to match the gains. No equations, self-citations, uniqueness theorems, or ansatzes are invoked to derive the central claims; the improvements are presented as direct empirical measurements. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The work relies on standard assumptions from efficient VLM fine-tuning and introduces one new module whose effectiveness is shown only through the reported benchmark numbers. The fixed token budget is a design parameter rather than a derived quantity.

free parameters (2)

token budget = 256
Fixed target of 256 tokens chosen to balance compression ratio against generation quality.
added trainable parameters = 21.7M
Size of the UIPress module plus LoRA adapters reported as ~21.7M (0.26% of base model).

axioms (2)

domain assumption Freezing the ViT encoder while training only the compression module and LoRA suffices for effective adaptation.
Standard parameter-efficient tuning practice invoked implicitly by the training description.
domain assumption CLIP score serves as a reliable proxy for the semantic and visual fidelity of generated UI code.
Primary metric used to claim superiority over baselines.

invented entities (1)

UIPress compression module no independent evidence
purpose: Learned encoder-side token compressor combining depthwise convolutions, spatial reweighting, and Transformer refinement.
New module introduced in this work; no independent external validation provided in the abstract.

pith-pipeline@v0.9.0 · 5608 in / 1730 out tokens · 66248 ms · 2026-05-10T17:16:27.383683+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 46 canonical work pages · 15 internal anchors

[2]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Jun- yang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision- Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966 [cs.CV] https://arxiv.org/abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Liang Chen et al. 2024. An Image is Worth 1/2 Tokens After Layer 2: Plug-and- Play Inference Acceleration for Large Vision-Language Models.arXiv preprint arXiv:2403.06764(2024)

work page arXiv 2024
[7]

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. GUICourse: From General Vision Language Models to Versatile GUI Agents.arXiv preprint arXiv:2406.11317(2024)

work page arXiv 2024
[8]

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. 2024. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766 (2024)

work page arXiv 2024
[9]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, et al . 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.ArXivabs/2507.06261 (2025)...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Dasen Dai, Biao Wu, Meng Fang, and Wenhao Wang. 2026. PaperVoyager: Build- ing Interactive Web with Visual Language Models.arXiv preprint arXiv:2603.22999 (2026)

work page arXiv 2026
[11]

Meng Fang, Shilong Deng, Yudi Zhang, Zijing Shi, Ling Chen, Mykola Pech- enizkiy, and Jun Wang. 2024. Large language models are neurosymbolic rea- soners. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 17985–17993

2024
[12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. 2025. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749(2025)

work page internal anchor Pith review arXiv 2025
[16]

HuggingFace-M4. 2024. WebSight: A Synthetic Dataset of Website Screenshots and HTML Code. https://huggingface.co/datasets/HuggingFaceM4/WebSight

2024
[18]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, YanLiu Li, Ziwei Liu, and Chunyuan Li. 2024. LLaVA- OneVision: Easy Visual Task Transfer.arXiv preprint arXiv:2408.03329(2024)

work page arXiv 2024
[19]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024. LLaVA- OneVision: Easy Visual Task Transfer. arXiv:2408.03326 [cs.CV] https://arxiv. org/abs/2408.03326 MM ’26, October 2026, TBD Dasen Dai, Shuoqi Li, Ronghao Chen, Huacan Wang, Biao Wu, and Qizhen Lan

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

2023
[21]

Bishop, Alice Li, Christopher Rawles, Folawiyo Campbell- Ajala, Divya Tyamagundlu, and Oriana Riva

Wei Li, William E. Bishop, Alice Li, Christopher Rawles, Folawiyo Campbell- Ajala, Divya Tyamagundlu, and Oriana Riva. 2024. On the Effects of Data Scale on UI Control Agents. InNeurIPS

2024
[22]

Weishi Li, Yong Peng, Miao Zhang, Liang Ding, Han Hu, and Li Shen. 2023. Deep model fusion: A survey.arXiv preprint arXiv:2309.15698(2023)

work page arXiv 2023
[23]

Yixin Li et al. 2024. TokenPacker: Efficient Visual Projector for Multimodal LLM. arXiv preprint arXiv:2407.02392(2024)

work page arXiv 2024
[24]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning. arXiv:2304.08485 [cs.CV] https://arxiv.org/abs/2304.08485

work page internal anchor Pith review arXiv 2023
[25]

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al . 2024. Mmbench: Is your multi-modal model an all-around player?. InEuropean conference on computer vision. Springer, 216–233

2024
[26]

Zhihao Liu et al. 2025. DeepSeek-OCR: Optical Context Compression.arXiv preprint arXiv:2510.18234(2025)

work page internal anchor Pith review arXiv 2025
[27]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2023. Scissorhands: Exploit- ing the persistence of importance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems36 (2023), 52342–52364

2023
[28]

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. 2025. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785(2025)

work page internal anchor Pith review arXiv 2025
[29]

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. 2025. GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices. arXiv:2406.08451 [cs.CV] https://arxiv.org/abs/2406.08451

work page arXiv 2025
[30]

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. 2024. Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203(2024)

work page arXiv 2024
[31]

Ron Mokady, Amir Hertz, and Amit H Bermano. 2021. Clipcap: Clip prefix for image captioning.arXiv preprint arXiv:2111.09734(2021)

work page arXiv 2021
[32]

Qwen Team. 2025. Qwen3-VL: A Vision-Language Model Series. https://github. com/QwenLM/Qwen3-VL

2025
[33]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, et al. 2021. Learning Transferable Visual Models From Natural Language Supervision. InICML

2021
[34]

Yuzhang Shang et al. 2024. LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models.arXiv preprint arXiv:2403.15388(2024)

work page arXiv 2024
[35]

Haozhan Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. 2025. VLM-R1: A stable and generalizable R1-style Large Vision- Language Model. https://github.com/om-ai-lab/VLM-R1. Accessed: 2025-02-15

2025
[36]

Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, and Yang Zhao. 2025. Presentagent: Multimodal agent for presentation video generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 760–773

2025
[37]

Zijing Shi, Meng Fang, and Ling Chen. 2025. Monte carlo planning with large language model for text-based game agents.arXiv preprint arXiv:2504.16855 (2025)

work page arXiv 2025
[38]

Zijing Shi, Meng Fang, Ling Chen, Yali Du, and Jun Wang. 2024. Human-guided moral decision making in text-based games. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 21574–21582

2024
[39]

Zijing Shi, Meng Fang, Yunqiu Xu, Ling Chen, and Yali Du. 2022. Stay moral and explore: Learn to behave morally in text-based games. InThe Eleventh Interna- tional Conference on Learning Representations

2022
[40]

Zijing Shi, Meng Fang, Shunfeng Zheng, Shilong Deng, Ling Chen, and Yali Du
[41]

Cooperation on the fly: Exploring language agents for ad hoc teamwork in the avalon game.arXiv preprint arXiv:2312.17515(2023)

work page arXiv 2023
[42]

Zijing Shi, Yunqiu Xu, Meng Fang, and Ling Chen. 2023. Self-imitation learning for action generation in text-based games. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 703–726

2023
[43]

Chenglei Si, Yanzhe Li, et al . 2024. Design2Code: How Far Are We From Au- tomating Front-End Engineering?arXiv preprint arXiv:2403.03163(2024)

work page arXiv 2024
[44]

Yifan Sun et al. 2025. FlashVLM: Efficient Visual Token Compression for Fast Vision-Language Model Inference.arXiv preprint arXiv:2501.13230(2025)

work page arXiv 2025
[45]

Yuchen Sun, Shanhui Zhao, Tao Yu, Hao Wen, Samith Va, Mengwei Xu, Yuanchun Li, and Chongyang Zhang. 2025. GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration.arXiv preprint arXiv:2503.17709(2025)

work page arXiv 2025
[46]

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. 2025. GUI-G2: Gaussian Reward Modeling for GUI Grounding. arXiv:2507.15846 [cs.LG] https://arxiv.org/abs/2507.15846

work page arXiv 2025
[47]

Liujian Tang, Shaokang Dong, Yijia Huang, Minqi Xiang, Hongtao Ruan, Bin Wang, Shuo Li, Zhiheng Xi, Zhihui Cao, Hailiang Pang, Heng Kong, He Yang, Mingxu Chai, Zhilin Gao, Xingyu Liu, Yingnan Fu, Jiaming Liu, Xuanjing Huang, Yu-Gang Jiang, Tao Gui, Qi Zhang, Kang Wang, Yunke Zhang, and Yuran Wang
[48]

doi:10.48550/arXiv.2508.03700 , author =

MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning. arXiv:2508.03700 [cs.HC] https://arxiv.org/abs/ 2508.03700

work page arXiv
[49]

V-Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...

work page internal anchor Pith review arXiv 2025
[50]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems, Vol. 30

2017
[51]

Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, et al. 2025. Infinity parser: Layout aware reinforcement learning for scanned document parsing.arXiv preprint arXiv:2506.03197(2025)

work page arXiv 2025
[52]

Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua Meng, Weiqiang Wang, and Ke Xu. 2024. E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion. arXiv:2406.14250 [cs.CV] https://arxiv.org/abs/2406. 14250

work page arXiv 2024
[53]

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. 2023. Cogvlm: Visual expert for pretrained language models.arXiv preprint arXiv:2311.03079(2023)

work page arXiv 2023
[54]

Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, and Jun Wang. 2026. Vision- Language Reasoning for Geolocalization: A Reinforcement Learning Approach. arXiv preprint arXiv:2601.00388(2026)

work page arXiv 2026
[55]

Biao Wu, Yanda Li, Zhiwei Zhang, Yunchao Wei, Meng Fang, and Ling Chen
[56]

arXiv preprint arXiv:2411.02006(2024)

Foundations and recent trends in multimodal mobile agents: A survey. arXiv preprint arXiv:2411.02006(2024)

work page arXiv 2024
[57]

Biao Wu, Yutong Xie, Zeyu Zhang, Minh Hieu Phan, Qi Chen, Ling Chen, and Qi Wu. 2024. MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training.arXiv preprint arXiv:2407.19546(2024)

work page arXiv 2024
[58]

Penghao Wu, Shengnan Ma, Bo Wang, Jiaheng Yu, Lewei Lu, and Ziwei Liu
[59]

GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior.arXiv preprint arXiv:2506.08012(2025)

work page arXiv 2025
[60]

Yi Wu et al. 2024. WebCode2M: A Real-World, Large-Scale Frontend Code Dataset for Code Generation.arXiv preprint(2024)

2024
[61]

Zongru Wu, Pengzhou Cheng, Zheng Wu, Tianjie Ju, Zhuosheng Zhang, and Gongshen Liu. 2025. Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks. arXiv:2503.00401 [cs.CL] https: //arxiv.org/abs/2503.00401

work page arXiv 2025
[62]

Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. 2025. LLaVA- CoT: Let Vision Language Models Reason Step-by-Step. arXiv:2411.10440 [cs.CV] https://arxiv.org/abs/2411.10440

work page arXiv 2025
[63]

Junfeng Yan, Biao Wu, Meng Fang, and Ling Chen. 2025. Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems.arXiv preprint arXiv:2509.21143(2025)

work page arXiv 2025
[64]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al . 2024. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Senqiao Yang et al. 2024. VisionZip: Longer is Better but Not Necessary in Vision Language Models.arXiv preprint arXiv:2412.04467(2024)

work page arXiv 2024
[66]

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2024. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800(2024)

work page internal anchor Pith review arXiv 2024
[67]

Ya-Qi Yu et al. 2024. TextHawk2: A Large Vision-Language Model Excels in Bilin- gual OCR and Grounding with 16x Compression.arXiv preprint arXiv:2410.05261 (2024)

work page arXiv 2024
[68]

Jaeyoung Yun et al. 2024. EfficientUICoder: An Efficient Visual Token Compres- sion Method for UI-to-Code Generation.arXiv preprint(2024)

2024
[69]

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614(2024)

work page arXiv 2024
[70]

Screenshot-to-Code

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InACL. UIPress: Bringing Optical Token Compression to UI-to-Code Generation MM ’26, October 2026, TBD A Hyperparameters and Training Details Generation.For all Qwen3-VL-based methods,...

2024
[71]

All CSS must be inlined within a <style> tag in the <head>

Self-Contained: The HTML must include <!DOCTYPE html>, <html>, <head>, and <body> tags. All CSS must be inlined within a <style> tag in the <head>. No external CSS or JavaScript files
[72]

High Fidelity: Use Flexbox or CSS Grid to faithfully reproduce the header, navigation, main content, sidebar, footer, and any visible structures
[73]

Preserve colors, font sizes, spacing, alignment, borders, and backgrounds

Exact Match: Reproduce all visible text exactly (headings, paragraphs, buttons). Preserve colors, font sizes, spacing, alignment, borders, and backgrounds
[74]

placeholder.jpg

Assets: Use "placeholder.jpg" for image sources. Use colored <div> elements for decorative blocks
[75]

Figure 7: The system prompt used for the screenshot-to-code generation task

Escaping: Ensure all HTML attributes within the string use single quotes (') or are properly escaped to maintain valid JSON format. Figure 7: The system prompt used for the screenshot-to-code generation task