pith. sign in

arxiv: 2606.00750 · v1 · pith:W42XYPQQnew · submitted 2026-05-30 · 💻 cs.CL

I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications

Pith reviewed 2026-06-28 18:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords interactive web systemsLLM agentsscientific paper understandingbenchmarkwebpage synthesismechanism modelingPaperVoyager
0
0 comments X

The pith

PaperVoyager agent converts research papers into executable interactive web systems by explicitly modeling mechanisms and interaction logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an agent that takes a PDF research paper as input and produces a fully functional interactive web application, allowing users to adjust inputs and directly observe resulting changes in system behavior. It establishes a benchmark consisting of 19 research papers each paired with an expert-constructed interactive system that serves as ground truth. The authors introduce PaperVoyager as a structured framework that breaks the generation process into explicit steps for understanding the paper, modeling its mechanisms, and synthesizing the webpage. Experiments on this benchmark show the structured approach produces interactive systems of measurably higher quality than those generated without explicit mechanism modeling.

Core claim

The paper claims that an end-to-end Paper-to-Interactive-System Agent can transform a research paper PDF into an executable interactive webpage without human intervention, and that the PaperVoyager framework, which explicitly models mechanisms and interaction logic during synthesis, produces higher-quality outputs than unstructured generation methods when evaluated against expert-built ground-truth systems on a benchmark of 19 papers.

What carries the argument

The Paper-to-Interactive-System Agent performing paper understanding, system modeling, and interactive webpage synthesis in sequence, with PaperVoyager as the structured framework that enforces explicit modeling of mechanisms and interaction logic.

If this is right

  • Users gain the ability to manipulate inputs in the generated systems and observe corresponding dynamic behaviors and state transitions.
  • Explicit modeling of mechanisms and interaction logic during generation leads to higher quality interactive systems than direct generation approaches.
  • The method supplies a new paradigm in which scientific papers are understood through direct interaction rather than static summaries or documents.
  • The 19-paper benchmark enables systematic measurement of an agent's capacity to capture dynamic aspects of technical content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structured modeling steps could be tested on technical documents outside scientific papers, such as engineering manuals or regulatory texts.
  • Adding user studies that measure actual comprehension gains from interacting with the generated systems would test whether the quality improvements translate to better understanding.
  • Expanding the benchmark with additional papers and expert systems could reveal whether the observed gains hold across a wider range of technical domains.

Load-bearing premise

The 19 expert-built interactive systems paired with the papers form a reliable ground truth that correctly captures the dynamic mechanisms and state transitions described in each paper.

What would settle it

Independent experts rating PaperVoyager-generated systems as equivalent or inferior to baseline-generated systems on measures of functional correctness and fidelity to the original paper's described behaviors.

Figures

Figures reproduced from arXiv: 2606.00750 by Biao Wu, Dasen Dai, Meng Fang, Shuoqi Li, Wenhao Wang.

Figure 1
Figure 1. Figure 1: Overview of the I-WebGenBench evaluation pipeline. Generated applications are built [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of the 201 paper-derived specifications across five domains. Bars indicate the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (left) BSR vs IR. Bubble size and color indicate overall score. Most models achieve high [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual quality and functional interactivity are orthogonal. (a) Kimi-K2.5 renders a polished [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual variety of scientific web applications generated from the same specification (plasma [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: System prompt for the PDF-to-Specification stage. This identifies the core scientific [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: System prompt prepended to every model call during code generation. The identical prompt [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Repair prompt template. Placeholders {build_error} and {source_code} are filled at runtime. The round counter k is shown for illustration; all five rounds use the same template. C.4 Block Pipeline Prompts To handle the complexity of full-scale scientific applications, we introduce a “Block Pipeline” that decomposes the task into manageable UI units. This process uses two primary prompts: a Splitter to brea… view at source ↗
Figure 9
Figure 9. Figure 9: The Splitter prompt used to decompose a monolithic specification into discrete, imple [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The Merger prompt used to recombine individual blocks into a cohesive, bug-free [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The VLM Evaluation prompt template. This multidimensional rubric ensures that [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
read the original abstract

Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces I-WebGenBench, a benchmark of 19 research papers paired with expert-built interactive web systems as ground truth, and proposes PaperVoyager, a structured agent framework for end-to-end conversion of PDF papers into executable interactive web applications that model mechanisms and state transitions. It claims that PaperVoyager significantly outperforms prior approaches in generating high-quality interactive systems for dynamic scientific content.

Significance. If the evaluation holds, the work could establish a new paradigm for interactive scientific document understanding by moving beyond static summaries to manipulable web systems that capture dynamic behaviors, with potential applications in education and research dissemination.

major comments (2)
  1. [Benchmark section] Benchmark section (likely §4 or equivalent): The 19 expert-built interactive systems are presented as ground truth without any described construction protocol, expert selection criteria, inter-rater reliability metrics, coverage of state-transition types, or verification that the systems faithfully reproduce the original papers' dynamic mechanisms; this directly undermines the headline claim of significant quality improvements from PaperVoyager since all comparisons rest on this unvalidated reference.
  2. [Experiments section] Experiments section (likely §5): The abstract asserts 'significant improvement' from PaperVoyager but the provided text supplies no quantitative metrics, baselines, statistical tests, error analysis, or ablation results; without these, the central experimental claim cannot be evaluated for soundness or effect size.
minor comments (2)
  1. The title refers to I-WebGenBench while the abstract emphasizes PaperVoyager; clarify the relationship and ensure consistent naming throughout.
  2. Notation for system components (e.g., modeling of interaction logic) should be defined more explicitly if equations or pseudocode are present in later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these targeted comments on the benchmark and experimental evaluation. We agree that both sections require substantial expansion to make the claims fully evaluable, and we will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Benchmark section] Benchmark section (likely §4 or equivalent): The 19 expert-built interactive systems are presented as ground truth without any described construction protocol, expert selection criteria, inter-rater reliability metrics, coverage of state-transition types, or verification that the systems faithfully reproduce the original papers' dynamic mechanisms; this directly undermines the headline claim of significant quality improvements from PaperVoyager since all comparisons rest on this unvalidated reference.

    Authors: We agree the current manuscript lacks sufficient detail on ground-truth construction. In the revision we will expand Section 4 with: (i) expert selection criteria (domain researchers with publication records in the target subfield), (ii) a step-by-step construction protocol, (iii) inter-rater reliability scores (Cohen’s kappa on a 20 % overlap subset), (iv) explicit coverage statistics across state-transition categories, and (v) a verification checklist confirming fidelity to each paper’s described mechanisms. These additions will be placed before the main results. revision: yes

  2. Referee: [Experiments section] Experiments section (likely §5): The abstract asserts 'significant improvement' from PaperVoyager but the provided text supplies no quantitative metrics, baselines, statistical tests, error analysis, or ablation results; without these, the central experimental claim cannot be evaluated for soundness or effect size.

    Authors: We acknowledge that the version seen by the referee does not contain the full quantitative results. Section 5 will be expanded to report: concrete metrics (mechanism fidelity, interactivity score, execution success rate), three baselines (direct LLM prompting, ReAct-style agent, and a non-structured variant), paired statistical tests (Wilcoxon signed-rank with effect sizes), error analysis broken down by failure mode, and ablation results isolating the mechanism-modeling and interaction-logic modules. All numbers and significance statements will be added. revision: yes

Circularity Check

0 steps flagged

No circularity; evaluation uses externally constructed expert benchmark

full rationale

The paper introduces a benchmark consisting of 19 research papers paired with expert-built interactive systems as ground truth and evaluates the PaperVoyager framework against it. No equations, fitted parameters, self-citations, or ansatzes appear in the provided text. The quality improvement claim is measured against this externally described expert construction rather than reducing by definition or construction to the agent's own outputs or prior self-references. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the work rests on the unstated assumption that expert interactive systems faithfully encode paper mechanisms.

pith-pipeline@v0.9.1-grok · 5698 in / 1025 out tokens · 21262 ms · 2026-06-28T18:50:49.122341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 37 canonical work pages · 18 internal anchors

  1. [1]

    Ai pair programming in your terminal, 2024

    Aider-AI. Ai pair programming in your terminal, 2024. URL https://github.com/Aider-AI/aider. Accessed: 2025-04-22

  2. [2]

    SWE-Bench+: Enhanced Coding Benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024

    Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, and Song Wang. Swe-bench+: Enhanced coding benchmark for llms.arXiv preprint arXiv:2410.06992, 2024

  3. [3]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  4. [4]

    Webvr: an interactive web browser for virtual environments

    Emad Barsoum and Falko Kuester. Webvr: an interactive web browser for virtual environments. In Stereoscopic Displays and Virtual Reality Systems XII, volume 5664, pages 540–547. Spie, 2005

  5. [5]

    pix2code: Generating Code from a Graphical User Interface Screenshot

    Tony Beltramelli. pix2code: Generating code from a graphical user interface screenshot, 2017. URL https://arxiv.org/abs/1705.07962

  6. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    Webvln: Vision-and-language navigation on websites, 2023

    Qi Chen, Dileepa Pitawela, Chongyang Zhao, Gengze Zhou, Hsiang-Ting Chen, and Qi Wu. Webvln: Vision-and-language navigation on websites, 2023

  8. [8]

    Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025

    Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, et al. Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025

  9. [9]

    Paper2web: Let’s make your paper alive!arXiv preprint arXiv:2510.15842, 2025

    Yuhang Chen, Tianpeng Lv, Siyi Zhang, Yixiang Yin, Yao Wan, Philip S Yu, and Dongping Chen. Paper2web: Let’s make your paper alive!arXiv preprint arXiv:2510.15842, 2025

  10. [10]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  11. [11]

    Github copilot, 2024

    GitHub Copilot. Github copilot, 2024. URL https://github.com/features/copilot. Accessed: 2025-04-22

  12. [12]

    Cursor: The ai code editor, 2024

    Cursor. Cursor: The ai code editor, 2024. URLhttps://www.cursor.com/. Accessed: 2025-04-22

  13. [13]

    PaperVoyager : Building Interactive Web with Visual Language Models

    Dasen Dai, Biao Wu, Meng Fang, and Wenhao Wang. Papervoyager: Building interactive web with visual language models.arXiv preprint arXiv:2603.22999, 2026

  14. [14]

    Agentic Reinforced Policy Optimization

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization, 2025. URLhttps://arxiv.org/abs/2507.19849

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  16. [16]

    Iw-bench: Evaluating large multimodal models for converting image-to-web, 2024

    Hongcheng Guo, Wei Zhang, Junhao Chen, Yaonan Gu, Jian Yang, Junjia Du, Binyuan Hui, Tianyu Liu, Jianxin Ma, Chang Zhou, and Zhoujun Li. Iw-bench: Evaluating large multimodal models for converting image-to-web, 2024. URLhttps://arxiv.org/abs/2409.18980

  17. [17]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

  18. [18]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  19. [19]

    Webgen-r1: Incentivizing llms to generate functional and aesthetic websites with reinforcement learning

    Juyong Jiang, Chansung Park, Jiasi Shen, Sunghun Kim, Jianguo Li, Yue Wang, et al. Webgen-r1: Incentivizing llms to generate functional and aesthetic websites with reinforcement learning

  20. [20]

    Webgen-r1: Incentivizing llms to generate functional and aesthetic websites with reinforcement learning

    Juyong Jiang, Chansung Park, Jiasi Shen, Sunghun Kim, Jianguo Li, Yue Wang, et al. Webgen-r1: Incentivizing llms to generate functional and aesthetic websites with reinforcement learning. 2026. 10

  21. [21]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  22. [22]

    WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

    Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang, Chenyu Zhou, Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, et al. Webcompass: Towards multimodal web coding evaluation for code language models.arXiv preprint arXiv:2604.18224, 2026

  23. [23]

    MiniMax-01: Scaling Foundation Models with Lightning Attention

    Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025

  24. [24]

    WebCoderBench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics, 2026

    Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, and Tao Xie. Webcoderbench: Benchmarking web applica- tion generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

  25. [25]

    RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

    Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091, 2023

  26. [26]

    Uxagent: An llm agent-based usability testing framework for web design

    Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Laurence Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. Uxagent: An llm agent-based usability testing framework for web design. arXiv preprint arXiv:2502.12561, 2025

  27. [27]

    WebGen-Bench: Evaluating LLMs on generating interactive and functional websites from scratch, 2025

    Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

  28. [28]

    Swe-lancer: Can frontier llms earn $1 million from real-world freelance software engineering?arXiv preprint arXiv:2502.12115, 2025

    Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. Swe-lancer: Can frontier llms earn $1 million from real-world freelance software engineering?arXiv preprint arXiv:2502.12115, 2025

  29. [29]

    Octopack: Instruction tuning code large language models

    Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro V on Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

  30. [30]

    Presentagent: Multimodal agent for presentation video generation

    Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, and Yang Zhao. Presentagent: Multimodal agent for presentation video generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 760–773, 2025

  31. [31]

    De- sign2Code: Benchmarking multimodal code generation for automated front-end engineering

    Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering, 2025. URL https: //arxiv.org/abs/2403.03163

  32. [32]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  33. [33]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  34. [34]

    Webgen-v bench: Structured representation for enhancing visual design in llm-based web generation and evaluation.arXiv preprint arXiv:2510.15306, 2025

    Kuang-Da Wang, Zhao Wang, Yotaro Shimose, Wei-Yao Wang, and Shingo Takamatsu. Webgen-v bench: Structured representation for enhancing visual design in llm-based web generation and evaluation.arXiv preprint arXiv:2510.15306, 2025

  35. [35]

    Openhands: An open platform for ai software developers as generalist agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2024

  36. [36]

    Introducing devin, the first ai software engineer, 2024

    Scott Wu. Introducing devin, the first ai software engineer, 2024. URL https://cognition.ai/blog/ introducing-devin. Accessed: 2025-04-22

  37. [37]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

  38. [38]

    Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping

    Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zixin Wang, Xinyi Xu, Wenxuan Wang, Zhiyao Xu, Yuhang Wang, and Michael R Lyu. Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 241–253. IEEE, 2025. 11

  39. [39]

    Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zixin Wang, Xinyi Xu, Wenxuan Wang, Zhiyao Xu, Yuhang Wang, and Michael R. Lyu. Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping, 2025. URLhttps://arxiv.org/abs/2411.03292

  40. [40]

    Jingyu Xiao, Ming Wang, Man Ho Lam, Yuxuan Wan, Junliang Liu, Yintong Huo, and Michael R. Lyu. Designbench: A comprehensive benchmark for mllm-based front-end code generation, 2025. URL https://arxiv.org/abs/2506.06251

  41. [41]

    Mimo: Unlocking the reasoning potential of language model–from pretraining to posttraining.arXiv preprint arXiv:2505.07608, 2025

    LLM Xiaomi, Bingquan Xia, Bowen Shen, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, et al. Mimo: Unlocking the reasoning potential of language model–from pretraining to posttraining.arXiv preprint arXiv:2505.07608, 2025

  42. [42]

    Swe-fixer: Training open-source llms for effective and efficient github issue resolution, 2025

    Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, and Kai Chen. Swe-fixer: Training open-source llms for effective and efficient github issue resolution, 2025. URL https://arxiv.org/ abs/2501.05040

  43. [43]

    Web-bench: A llm code benchmark based on web standards and frameworks, 2025

    Kai Xu, YiWei Mao, XinYi Guan, and ZiLong Feng. Web-bench: A llm code benchmark based on web standards and frameworks, 2025. URLhttps://arxiv.org/abs/2505.07473

  44. [44]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  45. [45]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  46. [46]

    Swe-bench multimodal: Do ai systems generalize to visual software domains?arXiv preprint arXiv:2410.03859, 2024

    John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, et al. Swe-bench multimodal: Do ai systems generalize to visual software domains?arXiv preprint arXiv:2410.03859, 2024

  47. [47]

    Webshop: Towards scalable real-world web interaction with grounded language agents, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2022

  48. [48]

    Xing, Xiaodan Liang, and Zhiqiang Shen

    Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, and Zhiqiang Shen. Web2code: A large-scale webpage-to-code dataset and evaluation framework for multimodal llms, 2024. URLhttps:/...

  49. [49]

    Repocoder: Repository-level code completion through iterative retrieval and generation.arXiv preprint arXiv:2303.12570, 2023

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570, 2023

  50. [50]

    Naturalcodebench: Examining coding performance mismatch on humaneval and natural user prompts.arXiv preprint arXiv:2405.04520, 2024

    Shudan Zhang, Hanlin Zhao, Xiao Liu, Qinkai Zheng, Zehan Qi, Xiaotao Gu, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Naturalcodebench: Examining coding performance mismatch on humaneval and natural user prompts.arXiv preprint arXiv:2405.04520, 2024

  51. [51]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024. 12 A Implementation Details In our experiments, for each model we consis...

  52. [52]

    Element Enumeration: We traverse the fully rendered page to identify interactive elements (e.g., buttons, range sliders, select menus), annotating them with bounding-box coordinates for robust targeting

  53. [53]

    Semantic Action Mapping: Elements are assigned canonical interactions based on their HTML semantics (e.g., populating text inputs, setting sliders to midpoints) to prevent arbitrary failure modes

  54. [54]

    Task Definition

    DOM Mutation Observation: For each action, we capture the DOM state before and after execution using aMutationObserverconfigured to track child lists, subtrees, and attributes. Here, ∆DOM (a, p) = 1 if action a triggers at least one DOM mutation. While BSR and IR provide binary indicators of structural viability, assessing the scientific fidelity and educ...

  55. [55]

    Block Pipeline

    Results/Analysis (interactive charts), and 5) Conclusion (synthesis). Technical Constraints: • Output a structured natural language specification focusing on UI components, state variables, and visual logic. • Ensure the specification is entirely in the requested output language. Figure 6: System prompt for the PDF-to-Specification stage. This identifies ...

  56. [56]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...