Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

Haoyu Dong

arxiv: 2606.10334 · v1 · pith:5UG4EV5Hnew · submitted 2026-06-09 · 💻 cs.AI

Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

Haoyu Dong This is my paper

Pith reviewed 2026-06-27 13:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords self-distillationpolicy optimizationvisual feedbackcode generationvisual artifactsgrounded credit weightingchart generationUI generation

0 comments

The pith

Visual feedback from rendered outputs lets code-generating LLMs self-distill corrections to defects via grounded credit weighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Code-generating LLMs commit to programs that produce charts, web pages, and slides before seeing the rendered results, so many outputs contain visually obvious defects such as overlaps, clipped text, or broken alignment. The paper proposes Visual-SDPO, a self-distillation policy optimization method that feeds the rendered visual output back to a weight-sharing teacher model and distills targeted corrections into the student coding policy. A Visual-Grounded Code Credit Weighting step traces each defect to the responsible code statements and amplifies the learning signal on those tokens rather than spreading it uniformly. A GRPO term rewards high-quality executable rollouts while execution failures stay learnable by routing error messages through the teacher path. The resulting model improves primary metrics by more than 10 points over the zero-shot baseline and at least 2.4 points over GRPO on ChartMimic, Design2Code, and AeSlides while using fewer training steps and no extra inference cost.

Core claim

Visual-SDPO treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student; Visual-Grounded Code Credit Weighting traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements; a sequence-level GRPO term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher.

What carries the argument

Visual-Grounded Code Credit Weighting traces each detected defect in the rendered output back to the responsible code statements and amplifies the distillation signal on those statements.

If this is right

Gains exceed 10 absolute points over zero-shot baselines on the primary metric across ChartMimic, Design2Code, and AeSlides.
Gains exceed 2.4 points over GRPO while requiring fewer training steps and adding no inference-time cost.
The same framework applies without modification to chart-to-code, UI-to-code, and slide-generation tasks using one backbone model.
Execution failures stay usable for learning because error messages are passed as privileged context to the teacher.
The dense token-level distillation objective is complemented by a sequence-level GRPO reward on visual quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tracing mechanism could be applied to other code-to-output pipelines where defects in a non-differentiable renderer need to be attributed to source statements.
Self-distillation from visual feedback may reduce reliance on external reward models or human annotations for visual code generation tasks.
The approach suggests that weight-sharing teacher-student setups can transfer information across the code-to-render boundary more efficiently than standard policy gradients alone.

Load-bearing premise

Visual defects detected in the rendered output can be reliably traced back to the responsible code statements so that the credit weighting produces a correct rather than noisy distillation signal.

What would settle it

An ablation that replaces the defect-tracing step with random or uniform credit assignment and still obtains the full reported gains on ChartMimic would show that accurate tracing is not required for the claimed improvement.

Figures

Figures reproduced from arXiv: 2606.10334 by Haoyu Dong.

**Figure 1.** Figure 1: Visual-SDPO overview. The student LLM generates a code rollout (shown as a simplified token strip, top center); the renderer (matplotlib, Playwright, or python-pptx) produces a rendered artifact. A defect detector returns regions around visual problems (legend overlapping a data line in the example shown), and a rubric extractor produces a structured binary defect table. Both feed the teacher LLM, which sh… view at source ↗

read the original abstract

Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow. We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student. To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements. A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points, with fewer training steps and no added inference-time cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Visual-SDPO adds grounded credit weighting to visual-feedback self-distillation for code gen, but the tracing step that makes it work has no reported validation.

read the letter

The new piece here is Visual-Grounded Code Credit Weighting: it tries to map visual defects in the rendered output back to the exact code statements that caused them, then boosts the distillation signal on those tokens. They pair this with a sequence-level GRPO term and pass execution errors as extra context to the teacher. The backbone is Qwen3-VL-8B, tested on ChartMimic, Design2Code, and AeSlides.

The approach is straightforward and targets a real pain point—executable code that still looks broken. Reporting gains of more than 10 points over zero-shot and at least 2.4 over plain GRPO, with fewer steps and no inference cost, is the kind of practical result that matters in this sub-area.

The main weakness is exactly the one the stress-test note flags. The whole method stands on the assumption that visual defects can be traced reliably to the responsible code lines. The abstract gives no ablation, no precision numbers on the tracer, and no human agreement check. If the attribution is often wrong, the token-level objective just becomes noisier supervision, and the observed lift could come from the self-distillation path or GRPO alone. Without that check, the central claim is hard to evaluate.

This is for groups already working on code-to-visual pipelines or UI generation. A reader who cares about targeted credit assignment in non-differentiable renderers will find the idea worth testing. The paper deserves peer review because it ships a distinct mechanism and concrete benchmark numbers, even though the tracing reliability needs direct evidence before the gains can be trusted.

Referee Report

2 major / 1 minor

Summary. The paper proposes Visual-SDPO, a self-distillation policy-optimization framework for code-generating LLMs that produces visual artifacts (charts, web pages, slides). Rendered visual feedback serves as privileged context for a weight-sharing teacher that distills into the student; Visual-Grounded Code Credit Weighting traces detected defects (overlap, clipping, etc.) back to responsible code statements to amplify the token-level distillation signal on those statements. A complementary sequence-level GRPO term rewards executable high-quality rollouts. The method is instantiated on a Qwen3-VL-8B backbone and is reported to yield >10 absolute-point gains over the zero-shot base and at least 2.4 points over GRPO on ChartMimic, Design2Code, and AeSlides, with fewer training steps and no inference-time overhead.

Significance. If the empirical gains and the correctness of the defect-to-code attribution hold, the work would demonstrate a practical route to improving visual quality of LLM-generated artifacts via self-distillation that exploits non-differentiable renderers without adding inference cost. The combination of dense token-level credit assignment with sequence-level GRPO is a potentially reusable pattern for other code-to-visual tasks.

major comments (2)

[Method section on Visual-Grounded Code Credit Weighting] The section describing Visual-Grounded Code Credit Weighting provides no ablation, precision/recall metric on the tracing step, or human agreement study that quantifies how reliably visual defects are mapped back to the exact code statements that produced the affected elements. Because the central claim attributes the gains over GRPO to this targeted credit weighting, the absence of any direct validation of attribution accuracy leaves open the possibility that observed improvements arise from the sequence-level GRPO term or the self-distillation path alone.
[Experiments and results section] The experimental section (and abstract) reports concrete numeric gains (>10 points over zero-shot, ≥2.4 over GRPO) but supplies no definition of the primary metric, baseline implementations, number of runs, statistical significance tests, or ablation results that isolate the contribution of Visual-Grounded Code Credit Weighting. Without these, the load-bearing empirical claim cannot be assessed.

minor comments (1)

Clarify whether the tracing procedure is deterministic or involves any learned components, and state the exact failure modes (e.g., overlapping elements that cannot be unambiguously attributed) that are handled by the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Method section on Visual-Grounded Code Credit Weighting] The section describing Visual-Grounded Code Credit Weighting provides no ablation, precision/recall metric on the tracing step, or human agreement study that quantifies how reliably visual defects are mapped back to the exact code statements that produced the affected elements. Because the central claim attributes the gains over GRPO to this targeted credit weighting, the absence of any direct validation of attribution accuracy leaves open the possibility that observed improvements arise from the sequence-level GRPO term or the self-distillation path alone.

Authors: We acknowledge that the current manuscript lacks direct quantitative validation of the attribution accuracy in Visual-Grounded Code Credit Weighting. In the revision we will add an ablation that removes this component while retaining GRPO and self-distillation, precision/recall metrics computed against manually annotated defect-to-statement mappings on a held-out subset, and a human agreement study reporting inter-annotator agreement on the tracing step. revision: yes
Referee: [Experiments and results section] The experimental section (and abstract) reports concrete numeric gains (>10 points over zero-shot, ≥2.4 over GRPO) but supplies no definition of the primary metric, baseline implementations, number of runs, statistical significance tests, or ablation results that isolate the contribution of Visual-Grounded Code Credit Weighting. Without these, the load-bearing empirical claim cannot be assessed.

Authors: We agree the experimental reporting requires additional detail. The revision will explicitly define the primary metric, describe baseline implementations, report means and standard deviations over multiple random seeds together with statistical significance tests, and present ablations that isolate the contribution of Visual-Grounded Code Credit Weighting from the GRPO and self-distillation terms. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with external benchmarks

full rationale

The paper describes Visual-SDPO and Visual-Grounded Code Credit Weighting as a new framework, then reports absolute improvements on ChartMimic, Design2Code, and AeSlides benchmarks. No equations, derivations, or self-referential reductions appear in the provided text; the central claims rest on benchmark comparisons rather than any fitted quantity or self-citation chain being renamed as a prediction. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; standard RL hyperparameters and benchmark assumptions are implicit but unspecified.

pith-pipeline@v0.9.1-grok · 5821 in / 1062 out tokens · 34483 ms · 2026-06-27T13:40:35.001390+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 29 canonical work pages · 8 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes.arXiv preprint arXiv:2306.13649,

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: learning from self- generated mistakes. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2306.13649

work page arXiv 2024
[2]

pix2code: Generating Code from a Graphical User Interface Screenshot

Tony Beltramelli. pix2code: generating code from a graphical user interface screenshot. In Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems (EICS), 2018. arXiv:1705.07962

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Breaking the SFT plateau: multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587, 2025

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, and Lin Ma. Breaking the SFT plateau: multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587, 2025

work page arXiv 2025
[4]

ChartEditor: a reinforcement learning framework for robust chart editing.arXiv preprint arXiv:2511.15266, 2025

Liangyu Chen, Yichen Xu, Jianzhe Ma, Yuqi Liu, Donglu Yang, Liang Zhang, Wenxuan Wang, and Qin Jin. ChartEditor: a reinforcement learning framework for robust chart editing.arXiv preprint arXiv:2511.15266, 2025

work page arXiv 2025
[5]

Learning only with images: Visual rein- forcement learning with reasoning, rendering, and visual feedback.arXiv preprint arXiv:2507.20766, 2025

Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. Learning only with images: visual reinforcement learning with reasoning, rendering, and visual feedback.arXiv preprint arXiv:2507.20766, 2025

work page arXiv 2025
[6]

DesignCoder: hierarchy-aware and self-correcting UI code generation with large language models.arXiv preprint arXiv:2506.13663, 2025

Yunnong Chen, Shixian Ding, YingYing Zhang, Wenkai Chen, Jinzhou Du, Lingyun Sun, and Liuqing Chen. DesignCoder: hierarchy-aware and self-correcting UI code generation with large language models.arXiv preprint arXiv:2506.13663, 2025

work page arXiv 2025
[7]

Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871,

Ken Ding. HDPO: hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026

work page arXiv 2026
[8]

DW ARF debugging information format, version 5.https://dwarfstd.org, 2017

DW ARF Debugging Information Format Committee. DW ARF debugging information format, version 5.https://dwarfstd.org, 2017

2017
[9]

Yi Gui, Yao Wan, Zhen Li, Zhongyi Zhang, Dongping Chen, Hongyu Zhang, Yi Su, Bohua Chen, Xing Zhou, Wenbin Jiang, and 1 others

Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Bohua Chen, Yi Su, Dongping Chen, Siyuan Wu, Xing Zhou, Wenbin Jiang, Hai Jin, and Xiangliang Zhang. WebCode2M: a real- world dataset for code generation from webpage designs. InProceedings of the ACM Web Conference (WWW), pages 1834–1845, 2025. arXiv:2404.06369

work page arXiv 2025
[10]

Visual debugging techniques for reactive data visualization.Computer Graphics Forum, 35(3):271–280, 2016

Jane Hoffswell, Arvind Satyanarayan, and Jeffrey Heer. Visual debugging techniques for reactive data visualization.Computer Graphics Forum, 35(3):271–280, 2016. EuroVis 2016

2016
[11]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Ryan Li, Yanzhe Zhang, and Diyi Yang

Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, and Xiangyu Yue. ScreenCoder: advancing visual-to-code generation for front-end automation via modular multimodal agents.arXiv preprint arXiv:2507.22827, 2025

work page arXiv 2025
[13]

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Junlong Ke, Zichen Wen, Weijia Li, Conghui He, and Linfeng Zhang. Respecting self- uncertainty in on-policy self-distillation for efficient LLM reasoning.arXiv preprint arXiv:2605.13255, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Ko and Brad A

Andrew J. Ko and Brad A. Myers. Designing the Whyline: a debugging interface for asking questions about program behavior. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), pages 151–158, 2004

2004
[15]

Unlocking the conversion of web screenshots into html code with the websight dataset.arXiv preprint arXiv:2403.09029,

Hugo Laurençon, Léo Tronchon, and Victor Sanh. Unlocking the conversion of web screenshots into HTML code with the WebSight dataset.arXiv preprint arXiv:2403.09029, 2024. 9

work page arXiv 2024
[16]

Pix2Struct: screenshot parsing as pretraining for visual language understanding

Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisensch- los, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: screenshot parsing as pretraining for visual language understanding. InInternational Conference on Machine Learning (ICML), pages 18893–18912, 2023. arXiv:2210.03347

work page arXiv 2023
[17]

Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020b

Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, and Bo Zhou. ReLook: vision-grounded RL with a multimodal LLM critic for agentic web coding.arXiv preprint arXiv:2510.11498, 2025

work page arXiv 2025
[18]

Unifying distillation and privileged information

David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation and privileged information. InInternational Conference on Learning Representations (ICLR),
[19]

Nsight Graphics shader debugger

NVIDIA. Nsight Graphics shader debugger. https://developer.nvidia.com/ nsight-graphics, 2025

2025
[20]

AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

Yiming Pan, Chengwei Hu, Xuancheng Huang, Can Huang, Mingming Zhao, Yuean Bi, Xiaohan Zhang, Aohan Zeng, and Linmei Hu. AeSlides: incentivizing aesthetic layout in LLM-based slide generation via verifiable rewards.arXiv preprint arXiv:2604.22840, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Paper2Poster: towards multimodal poster automation from scientific papers.arXiv preprint arXiv:2505.21497, 2025

Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, and Philip Torr. Paper2Poster: towards multimodal poster automation from scientific papers.arXiv preprint arXiv:2505.21497, 2025

work page arXiv 2025
[22]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Reed, Zachary DeVito, Horace He, Ansley Ussery, and Jason Ansel

James K. Reed, Zachary DeVito, Horace He, Ansley Ussery, and Jason Ansel. torch.fx: practical program capture and transformation for deep learning in Python. InProceedings of Machine Learning and Systems (MLSys), 2022. arXiv:2112.08429

work page arXiv 2022
[24]

How do I inspect a pixel value? https://renderdoc.org/docs/how/how_ inspect_pixel.html, 2025

RenderDoc. How do I inspect a pixel value? https://renderdoc.org/docs/how/how_ inspect_pixel.html, 2025

2025
[25]

Juan A. Rodriguez, Haotian Zhang, Abhay Puri, Aarash Feizi, Rishav Pramanik, Pascal Wich- mann, Arnab Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, and Marco Pedersoli. Rendering-aware reinforcement learning for vector graphics generation.arXiv preprint arXiv:2505.20793, 2025

work page arXiv 2025
[26]

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Guobin Shen, Xiang Cheng, Chenxiao Zhao, Lei Huang, Jindong Li, Dongcheng Zhao, and Xing Yu. Anti-self-distillation for reasoning RL via pointwise mutual information.arXiv preprint arXiv:2605.11609, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Design2Code: benchmarking multimodal code generation for automated front-end engineering.arXiv preprint arXiv:2403.03163, 2024

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2Code: benchmarking multimodal code generation for automated front-end engineering.arXiv preprint arXiv:2403.03163, 2024

work page arXiv 2024
[28]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574,

Alex Stein, Furong Huang, and Tom Goldstein. GATES: self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

work page arXiv 2026
[30]

ChartMas- ter: advancing chart-to-code generation with real-world charts and chart similarity reinforcement learning.arXiv preprint arXiv:2508.17608, 2025

Wentao Tan, Qiong Cao, Chao Xue, Yibing Zhan, Changxing Ding, and Xiaodong He. ChartMas- ter: advancing chart-to-code generation with real-world charts and chart similarity reinforcement learning.arXiv preprint arXiv:2508.17608, 2025

work page arXiv 2025
[31]

Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael R. Lyu. Divide-and-conquer: generating UI code from screenshots.Proceedings of the ACM on Software Engineering, 2(FSE):2099–2122, 2025. arXiv:2406.16386

work page arXiv 2099
[32]

Mark D. Weiser. Program slicing. InProceedings of the 5th International Conference on Software Engineering (ICSE), pages 439–449, 1981. 10

1981
[33]

arXiv preprint arXiv:2406.09961 , year=

Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, and Yujiu Yang. Chart- Mimic: evaluating LMM’s cross-modal reasoning capability via chart-to-code generation. In International Conference on Learning Representations (ICLR), 2025. arXiv:2406.09961

work page arXiv 2025
[34]

UI2Code^N: UI-to-Code Generation as Interactive Visual Optimization

Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiale Cheng, Xiaotao Gu, and Jie Tang. UI2CodeN: UI-to-code generation as interactive visual optimization.arXiv preprint arXiv:2511.08195, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Xing, Xiaodan Liang, and Zhiqiang Shen

Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, and Zhiqiang Shen. Web2Code: a large-scale webpage-to-code dataset and evaluation framework for multimodal LLMs. In Advances in N...

work page arXiv 2024
[36]

VinciCoder: unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025

Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, and Lin Ma. VinciCoder: unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025

work page arXiv 2025
[37]

ChartCoder: advancing multimodal large language model for chart-to-code generation

Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Zhiyuan Liu, and Maosong Sun. ChartCoder: advancing multimodal large language model for chart-to-code generation. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 7333–7348,
[38]

PPTAgent: generating and evaluating presentations beyond text-to-slides

Hao Zheng, Xinyan Guan, Hao Kong, Wenkai Zhang, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. PPTAgent: generating and evaluating presentations beyond text-to-slides. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 14402–14418, 2025. 11

2025

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes.arXiv preprint arXiv:2306.13649,

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: learning from self- generated mistakes. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2306.13649

work page arXiv 2024

[2] [2]

pix2code: Generating Code from a Graphical User Interface Screenshot

Tony Beltramelli. pix2code: generating code from a graphical user interface screenshot. In Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems (EICS), 2018. arXiv:1705.07962

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Breaking the SFT plateau: multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587, 2025

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, and Lin Ma. Breaking the SFT plateau: multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587, 2025

work page arXiv 2025

[4] [4]

ChartEditor: a reinforcement learning framework for robust chart editing.arXiv preprint arXiv:2511.15266, 2025

Liangyu Chen, Yichen Xu, Jianzhe Ma, Yuqi Liu, Donglu Yang, Liang Zhang, Wenxuan Wang, and Qin Jin. ChartEditor: a reinforcement learning framework for robust chart editing.arXiv preprint arXiv:2511.15266, 2025

work page arXiv 2025

[5] [5]

Learning only with images: Visual rein- forcement learning with reasoning, rendering, and visual feedback.arXiv preprint arXiv:2507.20766, 2025

Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. Learning only with images: visual reinforcement learning with reasoning, rendering, and visual feedback.arXiv preprint arXiv:2507.20766, 2025

work page arXiv 2025

[6] [6]

DesignCoder: hierarchy-aware and self-correcting UI code generation with large language models.arXiv preprint arXiv:2506.13663, 2025

Yunnong Chen, Shixian Ding, YingYing Zhang, Wenkai Chen, Jinzhou Du, Lingyun Sun, and Liuqing Chen. DesignCoder: hierarchy-aware and self-correcting UI code generation with large language models.arXiv preprint arXiv:2506.13663, 2025

work page arXiv 2025

[7] [7]

Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871,

Ken Ding. HDPO: hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026

work page arXiv 2026

[8] [8]

DW ARF debugging information format, version 5.https://dwarfstd.org, 2017

DW ARF Debugging Information Format Committee. DW ARF debugging information format, version 5.https://dwarfstd.org, 2017

2017

[9] [9]

Yi Gui, Yao Wan, Zhen Li, Zhongyi Zhang, Dongping Chen, Hongyu Zhang, Yi Su, Bohua Chen, Xing Zhou, Wenbin Jiang, and 1 others

Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Bohua Chen, Yi Su, Dongping Chen, Siyuan Wu, Xing Zhou, Wenbin Jiang, Hai Jin, and Xiangliang Zhang. WebCode2M: a real- world dataset for code generation from webpage designs. InProceedings of the ACM Web Conference (WWW), pages 1834–1845, 2025. arXiv:2404.06369

work page arXiv 2025

[10] [10]

Visual debugging techniques for reactive data visualization.Computer Graphics Forum, 35(3):271–280, 2016

Jane Hoffswell, Arvind Satyanarayan, and Jeffrey Heer. Visual debugging techniques for reactive data visualization.Computer Graphics Forum, 35(3):271–280, 2016. EuroVis 2016

2016

[11] [11]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Ryan Li, Yanzhe Zhang, and Diyi Yang

Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, and Xiangyu Yue. ScreenCoder: advancing visual-to-code generation for front-end automation via modular multimodal agents.arXiv preprint arXiv:2507.22827, 2025

work page arXiv 2025

[13] [13]

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Junlong Ke, Zichen Wen, Weijia Li, Conghui He, and Linfeng Zhang. Respecting self- uncertainty in on-policy self-distillation for efficient LLM reasoning.arXiv preprint arXiv:2605.13255, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Ko and Brad A

Andrew J. Ko and Brad A. Myers. Designing the Whyline: a debugging interface for asking questions about program behavior. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), pages 151–158, 2004

2004

[15] [15]

Unlocking the conversion of web screenshots into html code with the websight dataset.arXiv preprint arXiv:2403.09029,

Hugo Laurençon, Léo Tronchon, and Victor Sanh. Unlocking the conversion of web screenshots into HTML code with the WebSight dataset.arXiv preprint arXiv:2403.09029, 2024. 9

work page arXiv 2024

[16] [16]

Pix2Struct: screenshot parsing as pretraining for visual language understanding

Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisensch- los, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: screenshot parsing as pretraining for visual language understanding. InInternational Conference on Machine Learning (ICML), pages 18893–18912, 2023. arXiv:2210.03347

work page arXiv 2023

[17] [17]

Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020b

Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, and Bo Zhou. ReLook: vision-grounded RL with a multimodal LLM critic for agentic web coding.arXiv preprint arXiv:2510.11498, 2025

work page arXiv 2025

[18] [18]

Unifying distillation and privileged information

David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation and privileged information. InInternational Conference on Learning Representations (ICLR),

[19] [19]

Nsight Graphics shader debugger

NVIDIA. Nsight Graphics shader debugger. https://developer.nvidia.com/ nsight-graphics, 2025

2025

[20] [20]

AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

Yiming Pan, Chengwei Hu, Xuancheng Huang, Can Huang, Mingming Zhao, Yuean Bi, Xiaohan Zhang, Aohan Zeng, and Linmei Hu. AeSlides: incentivizing aesthetic layout in LLM-based slide generation via verifiable rewards.arXiv preprint arXiv:2604.22840, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Paper2Poster: towards multimodal poster automation from scientific papers.arXiv preprint arXiv:2505.21497, 2025

Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, and Philip Torr. Paper2Poster: towards multimodal poster automation from scientific papers.arXiv preprint arXiv:2505.21497, 2025

work page arXiv 2025

[22] [22]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Reed, Zachary DeVito, Horace He, Ansley Ussery, and Jason Ansel

James K. Reed, Zachary DeVito, Horace He, Ansley Ussery, and Jason Ansel. torch.fx: practical program capture and transformation for deep learning in Python. InProceedings of Machine Learning and Systems (MLSys), 2022. arXiv:2112.08429

work page arXiv 2022

[24] [24]

How do I inspect a pixel value? https://renderdoc.org/docs/how/how_ inspect_pixel.html, 2025

RenderDoc. How do I inspect a pixel value? https://renderdoc.org/docs/how/how_ inspect_pixel.html, 2025

2025

[25] [25]

Juan A. Rodriguez, Haotian Zhang, Abhay Puri, Aarash Feizi, Rishav Pramanik, Pascal Wich- mann, Arnab Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, and Marco Pedersoli. Rendering-aware reinforcement learning for vector graphics generation.arXiv preprint arXiv:2505.20793, 2025

work page arXiv 2025

[26] [26]

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Guobin Shen, Xiang Cheng, Chenxiao Zhao, Lei Huang, Jindong Li, Dongcheng Zhao, and Xing Yu. Anti-self-distillation for reasoning RL via pointwise mutual information.arXiv preprint arXiv:2605.11609, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Design2Code: benchmarking multimodal code generation for automated front-end engineering.arXiv preprint arXiv:2403.03163, 2024

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2Code: benchmarking multimodal code generation for automated front-end engineering.arXiv preprint arXiv:2403.03163, 2024

work page arXiv 2024

[28] [28]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574,

Alex Stein, Furong Huang, and Tom Goldstein. GATES: self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

work page arXiv 2026

[30] [30]

ChartMas- ter: advancing chart-to-code generation with real-world charts and chart similarity reinforcement learning.arXiv preprint arXiv:2508.17608, 2025

Wentao Tan, Qiong Cao, Chao Xue, Yibing Zhan, Changxing Ding, and Xiaodong He. ChartMas- ter: advancing chart-to-code generation with real-world charts and chart similarity reinforcement learning.arXiv preprint arXiv:2508.17608, 2025

work page arXiv 2025

[31] [31]

Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael R. Lyu. Divide-and-conquer: generating UI code from screenshots.Proceedings of the ACM on Software Engineering, 2(FSE):2099–2122, 2025. arXiv:2406.16386

work page arXiv 2099

[32] [32]

Mark D. Weiser. Program slicing. InProceedings of the 5th International Conference on Software Engineering (ICSE), pages 439–449, 1981. 10

1981

[33] [33]

arXiv preprint arXiv:2406.09961 , year=

Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, and Yujiu Yang. Chart- Mimic: evaluating LMM’s cross-modal reasoning capability via chart-to-code generation. In International Conference on Learning Representations (ICLR), 2025. arXiv:2406.09961

work page arXiv 2025

[34] [34]

UI2Code^N: UI-to-Code Generation as Interactive Visual Optimization

Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiale Cheng, Xiaotao Gu, and Jie Tang. UI2CodeN: UI-to-code generation as interactive visual optimization.arXiv preprint arXiv:2511.08195, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Xing, Xiaodan Liang, and Zhiqiang Shen

Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, and Zhiqiang Shen. Web2Code: a large-scale webpage-to-code dataset and evaluation framework for multimodal LLMs. In Advances in N...

work page arXiv 2024

[36] [36]

VinciCoder: unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025

Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, and Lin Ma. VinciCoder: unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025

work page arXiv 2025

[37] [37]

ChartCoder: advancing multimodal large language model for chart-to-code generation

Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Zhiyuan Liu, and Maosong Sun. ChartCoder: advancing multimodal large language model for chart-to-code generation. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 7333–7348,

[38] [38]

PPTAgent: generating and evaluating presentations beyond text-to-slides

Hao Zheng, Xinyan Guan, Hao Kong, Wenkai Zhang, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. PPTAgent: generating and evaluating presentations beyond text-to-slides. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 14402–14418, 2025. 11

2025