pith. sign in

arxiv: 2511.08195 · v3 · submitted 2025-11-11 · 💻 cs.CV

UI2Code^N: UI-to-Code Generation as Interactive Visual Optimization

Pith reviewed 2026-05-17 23:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords UI-to-code generationvisual optimizationreinforcement learningiterative refinementvision-language modelsfront-end coderelative policy optimizationrendered feedback
0
0 comments X

The pith

UI-to-code generation improves by treating it as a closed-loop visual optimization process rather than single-pass output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that real UI development relies on iteration and visual feedback, so UI-to-code should be recast as an interactive optimization task. Code is produced, rendered to an image, evaluated visually, and refined in repeated cycles until the visual match improves. To handle the fact that visual quality cannot be differentiated directly and that absolute scores are noisy, the method uses relative comparisons between pairs of rendered results to guide updates. A 9B model trained this way reaches state-of-the-art scores on drafting, polishing, and editing benchmarks and keeps gaining accuracy with extra iterations, sometimes surpassing much larger models.

Core claim

UI-to-code generation can be reformulated as an interactive visual optimization problem in which code generation sits inside a closed loop of execution, visual inspection of the rendered interface, and iterative refinement driven by that visual feedback. Relative Visual Policy Optimization solves the non-differentiability and noise problems by learning from relative visual rankings among candidate renderings rather than absolute scores, allowing the model to improve steadily across multiple rounds.

What carries the argument

Relative Visual Policy Optimization (RVPO), a preference-based reinforcement learning procedure that ranks pairs of rendered UI outputs and updates the policy toward the visually preferred candidate under execution feedback.

If this is right

  • Performance on UI drafting, polishing, and editing tasks rises steadily with additional rounds of visual inspection and refinement.
  • A 9B model trained with this loop can exceed the results of larger single-pass models on the same benchmarks.
  • The same optimization loop applies equally to starting from a screenshot, improving existing code, or making targeted edits.
  • The open-source model and training recipe make the iterative refinement process reproducible for other front-end tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same relative-ranking loop could be tested on generating interactive web components where functional behavior rather than static appearance provides the feedback signal.
  • Pairing RVPO with stronger vision encoders might accelerate convergence and reduce the number of iterations needed.
  • Applying the method to mobile or desktop app layouts could reveal whether the visual-ranking signal generalizes beyond web screenshots.

Load-bearing premise

Rendered visual feedback can supply a consistent training signal for refinement even though visual quality is non-differentiable and absolute evaluators are noisy.

What would settle it

If multiple rounds of visual optimization produce no measurable gain in benchmark scores compared with a single-pass baseline on the same UI drafting or editing tasks, the benefit of the iterative loop would be refuted.

Figures

Figures reproduced from arXiv: 2511.08195 by Jiale Cheng, Jie Tang, Mingde Xu, Weihan Wang, Wenyi Hong, Xiaotao Gu, Xinyue Fan, Zhen Yang.

Figure 1
Figure 1. Figure 1: Top: Comparison of UI-to-code generation outputs from leading models versus our model, using the same reference screenshot. Our model achieves the highest fidelity, further enhanced by our UI polishing capability. Additional qualitative examples with diverse content, aspect ratios, and layouts are provided in Appendix A.5. Bottom left: Performance comparison on UI-to-code and UI polishing tasks. Bottom rig… view at source ↗
Figure 2
Figure 2. Figure 2: Our interactive UI-to-code paradigm integrates UI-to-code, UI polishing, and UI editing. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: UI2CodeN Demo Cases: UI-to-code (1/4) 19 [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: UI2CodeN Demo Cases: UI-to-code (2/4) 20 [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: UI2CodeN Demo Cases: UI-to-code (3/4) 21 [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: UI2CodeN Demo Cases: UI-to-code (4/4) 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: UI2CodeN Demo Cases: UI Editing (1/2) 23 [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: UI2CodeN Demo Cases: UI Editing (2/2) 24 [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
read the original abstract

UI-to-code aims to translate UI screenshots into executable front-end code. Despite progress with vision-language models (VLMs), most existing methods formulate UI-to-code as a single-pass generation, which mismatches real-world UI development that is inherently iterative and feedback-driven. We reformulate UI-to-code as an interactive visual optimization problem, where code generation is embedded in a closed-loop process of execution, visual inspection, and iterative refinement driven by rendered visual feedback. To address the non-differentiability of visual objectives and the noise of absolute visual evaluators, we propose Relative Visual Policy Optimization (RVPO), a preference-based reinforcement learning method that optimizes relative visual rankings among rendered candidates under execution feedback. We instantiate this paradigm in UI2Code^N, an open-source 9B model trained via continual pre-training, supervised fine-tuning, and reinforcement learning. Experiments demonstrate state-of-the-art performance on UI drafting, UI polishing, and UI editing benchmarks, even outperforming larger models, with performance consistently improving through iterative visual optimization. Our code and models are available at https://github.com/zai-org/UI2Code_N.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reformulates UI-to-code generation as an interactive visual optimization problem in a closed-loop process involving code execution, rendered visual inspection, and iterative refinement. It introduces Relative Visual Policy Optimization (RVPO), a preference-based RL approach that optimizes relative visual rankings among candidate renderings to handle non-differentiability and noise in absolute visual evaluators. The authors present UI2Code^N, a 9B model trained via continual pre-training, supervised fine-tuning, and RL, claiming state-of-the-art results on UI drafting, polishing, and editing benchmarks that outperform larger models, with performance improving consistently across iterations.

Significance. If the results hold, the work has moderate significance by shifting UI-to-code from single-pass generation to a more realistic iterative, feedback-driven paradigm that aligns with real development practices. The open-source release of the 9B model and code is a clear strength, as is the attempt to apply preference optimization to visual feedback. However, the significance hinges on whether the reported gains are attributable to the specific RVPO mechanism rather than the closed-loop setup or base model scale.

major comments (3)
  1. [§4] §4 (Experiments): The central claim of consistent improvement via iterative visual optimization and SOTA performance lacks ablations comparing RVPO directly to absolute-score RL or non-RL iterative baselines; without these, it is unclear whether relative rankings specifically mitigate noise better than alternatives, which is load-bearing for the method's contribution.
  2. [Table 2] Table 2 (UI editing results): Reported outperformance over larger models is presented without error bars, run counts, or statistical tests, making it difficult to verify the robustness of the 'consistently improving' assertion across iterations.
  3. [§3.2] §3.2 (RVPO formulation): The description of how relative visual rankings are derived from rendered feedback and execution does not include sufficient detail on preference data collection or ranking quality, which is necessary to evaluate the claim that this addresses non-differentiability and evaluator noise.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly reference prior iterative UI generation works to better situate the novelty of the closed-loop formulation.
  2. [Figure 1] Figure captions for the optimization loop diagram are somewhat terse and would benefit from additional labels explaining the RVPO preference step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. The feedback highlights important areas for strengthening the experimental validation and methodological clarity of our work on RVPO and the iterative visual optimization paradigm. We address each major comment point by point below and have made corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The central claim of consistent improvement via iterative visual optimization and SOTA performance lacks ablations comparing RVPO directly to absolute-score RL or non-RL iterative baselines; without these, it is unclear whether relative rankings specifically mitigate noise better than alternatives, which is load-bearing for the method's contribution.

    Authors: We agree that isolating the contribution of relative rankings is essential. In the revised manuscript, we have added a new ablation study in §4 that directly compares RVPO against an absolute-score RL baseline (using scalar visual rewards from the evaluator) and non-RL iterative baselines (repeated generation with visual selection but no policy update). Results show RVPO yields larger and more stable gains across iterations due to better handling of evaluator noise, with a new Table 4 summarizing these comparisons and supporting discussion. revision: yes

  2. Referee: [Table 2] Table 2 (UI editing results): Reported outperformance over larger models is presented without error bars, run counts, or statistical tests, making it difficult to verify the robustness of the 'consistently improving' assertion across iterations.

    Authors: We acknowledge the importance of statistical rigor for the iterative improvement claims. We have rerun the UI editing experiments across 5 independent random seeds and updated Table 2 to report means with standard deviations. We also added paired t-test p-values between iterations, confirming statistically significant improvements (p < 0.05) that support the robustness of consistent gains. revision: yes

  3. Referee: [§3.2] §3.2 (RVPO formulation): The description of how relative visual rankings are derived from rendered feedback and execution does not include sufficient detail on preference data collection or ranking quality, which is necessary to evaluate the claim that this addresses non-differentiability and evaluator noise.

    Authors: We have expanded §3.2 with additional details on the preference data pipeline. The revision describes how multiple code candidates are executed to produce renderings, how a visual critic generates pairwise preferences based on visual alignment to the target UI, and the quality filters applied (e.g., discarding low-confidence pairs). We added pseudocode and discussion explaining why relative rankings are more robust to noise and non-differentiability than absolute scores, along with a new illustrative figure. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained with independent method proposal and empirical results

full rationale

The paper reformulates UI-to-code generation as an interactive visual optimization problem and introduces RVPO as a novel preference-based RL approach to optimize relative visual rankings under execution feedback. Training proceeds via standard stages of continual pre-training, supervised fine-tuning, and reinforcement learning on a 9B model, with claimed SOTA results on drafting, polishing, and editing benchmarks plus iterative improvement. No step reduces a claimed prediction or result to its own inputs by construction (e.g., no fitted parameters renamed as predictions, no self-definitional loops in the optimization objective, and no load-bearing self-citations that substitute for external verification). The central claims rest on the proposed RVPO mechanism and closed-loop setup rather than tautological redefinitions, making the derivation independent of the target outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that visual rendering feedback provides useful training signal and that relative ranking in RL can overcome evaluator noise; RVPO is introduced as a new component without independent prior evidence.

axioms (1)
  • domain assumption Vision-language models can produce executable front-end code from UI screenshots as a starting point
    Invoked in the setup of the UI-to-code task and the closed-loop process.
invented entities (1)
  • Relative Visual Policy Optimization (RVPO) no independent evidence
    purpose: Preference-based RL to optimize relative visual rankings among rendered code candidates under execution feedback
    New method proposed to address non-differentiability and noise in visual objectives.

pith-pipeline@v0.9.0 · 5512 in / 1338 out tokens · 69892 ms · 2026-05-17T23:52:52.575301+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

    cs.CL 2026-02 unverdicted novelty 7.0

    Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · cited by 1 Pith paper

  1. [1]

    - 100 = perfectly identical

    Assign a similarity score (0–100) to both the second and third images with respect to the reference: - 0 = completely dissimilar. - 100 = perfectly identical. - When scoring, consider the following dimensions with approximate weights: - Layout structure (30%): element positions, alignment, and overall lay- out. - Color fidelity (25%): background, text, bu...

  2. [2]

    layout and colors are almost identical

    Provide a brief justification for each score: - List 2–3 major differences and explain why they affect the score. - If the rendering is highly consistent, state the reasons (e.g., “layout and colors are almost identical”)

  3. [3]

    - The conclusionmustbe enclosed in LaTeX \\boxed{}

    Provide a final conclusion: indicate which rendering (second or third) is closer to the reference. - The conclusionmustbe enclosed in LaTeX \\boxed{}. - For example:\\boxed{The second image is better}

  4. [4]

    The output format must strictly follow this template: A.3 EVALUATIONMETRICSSPECIFICATIONS A.3.1 EVALUATION FORUI-TO-CODE For the UI-to-code task, we employo4-minias the visual evaluator to assess the fidelity of gener- ated renderings. Given the reference screenshotAand the renderingBgenerated from the predicted HTML/CSS code,o4-minioutputs a similarity s...

  5. [5]

    Provide the final score, where the valuemustbe enclosed in LaTeX\\boxed{}

  6. [6]

    A.3.2 EVALUATION FORUI POLISHING For the UI polishing task, we employGemini-2.5-Proas the visual evaluator

    Provide a short justification, explaining the key similarities and differences that influenced your score. A.3.2 EVALUATION FORUI POLISHING For the UI polishing task, we employGemini-2.5-Proas the visual evaluator. The model is prompted with a triplet comparison: a reference screenshotA, an initial renderingB, and a polished renderingC. It is asked to ass...

  7. [7]

    - 100 means exactly the same as the reference

    Assign a score to both the second and third images, with a range of 0–100: - 0 means completely dissimilar to the reference. - 100 means exactly the same as the reference

  8. [8]

    When scoring, consider layout, color scheme, typography, spacing, and element details

  9. [9]

    Briefly explain the reason for each score

  10. [10]

    The conclusion should be wrapped in LaTeX\\boxed{}, for example: Second image score: 85 Reason: Overall layout is consistent, but the font is slightly smaller

    Provide a final conclusion: which image is closer to the reference. The conclusion should be wrapped in LaTeX\\boxed{}, for example: Second image score: 85 Reason: Overall layout is consistent, but the font is slightly smaller. Colors are mostly accurate. Third image score: 78 Reason: Most elements are reproduced, but button styles and spacing differ sign...