Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Fan Ding; Hangting Lou; Haoqing Yu; Haoyue Yang; Hua Wu; Jing Liu; Jingyao Li; Siqi Bao; Yifeng Kou; Zhangxiao Shen

arxiv: 2605.30000 · v2 · pith:JO5N76H6new · submitted 2026-05-28 · 💻 cs.AI

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Haoyue Yang , Zhangxiao Shen , Fan Ding , Hangting Lou , Yifeng Kou , Haoqing Yu , Jingyao Li , Zhengfan Wu

show 3 more authors

Siqi Bao Jing Liu Hua Wu

This is my paper

Pith reviewed 2026-06-29 07:21 UTC · model grok-4.3

classification 💻 cs.AI

keywords web generationLLM evaluationinteractive benchmarksagent-based evaluationfront-end developmentreference-free evaluationcontinuous screen interactionmetacognitive monitoring

0 comments

The pith

Cookie-Frame matches expert human ratings on interactive web generation without references or test suites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Cookie-Bench, a reference-free benchmark of one thousand web-development queries across eleven domains and three difficulty levels, to test both static pages and interactive applications. It introduces Cookie-Frame, a three-stage evaluator that first forms a static impression, then lets an autonomous agent explore the live interface while recording continuous screen video and per-step screenshots, and finally issues holistic functionality and aesthetics scores with failure attribution. This regime is designed to replicate the reasoned synthesis a human reviewer performs during a live session. On the benchmark the method aligns closely with expert human ratings and identifies substantial performance gaps among thirteen frontier LLMs. The approach therefore supports scalable, autonomous evaluation of web-generation models without requiring reference implementations.

Core claim

Cookie-Bench supplies an 11-domain, 54-leaf, 1000-query WebDev benchmark balanced across static-presentation and interactive-application tasks; Cookie-Frame implements a metacognition-inspired regime that separates evidence accumulation (static perception plus agent-driven continuous screen interaction) from holistic judgment (dynamic scoring), achieving close alignment with expert human ratings while exposing headroom across frontier LLMs on interactive web generation.

What carries the argument

Cookie-Frame, the three-stage process of static perception, agent-driven interaction with continuous screen-video capture, and post-evidence dynamic scoring with structured failure attribution.

If this is right

Evaluation of LLM-generated interactive web applications can proceed at scale without human judges or reference code at each iteration.
Current frontier models exhibit measurable shortfalls on both functionality and aesthetics when judged under continuous-interaction conditions.
The same reference-free regime applies equally to static presentation tasks and to dynamic application tasks.
Structured failure attribution produced after full evidence collection supplies actionable diagnostic signals for model improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The continuous screen-capture record could serve as training data for reward models that learn to predict human preference directly from interaction traces.
The separation of evidence accumulation from judgment may generalize to other GUI domains such as mobile or desktop application generation.
Because the benchmark resists recall of circulated prompts, repeated use of the same queries is less likely to inflate reported performance over time.

Load-bearing premise

An autonomous agent performing continuous screen interaction and later holistic scoring can replicate the reasoned synthesis a human reviewer performs over a live session without any reference implementation or test suite.

What would settle it

A side-by-side study in which multiple expert human raters independently score the same set of generated web applications and the resulting scores diverge substantially from Cookie-Frame verdicts on a non-negligible fraction of cases.

Figures

Figures reproduced from arXiv: 2605.30000 by Fan Ding, Hangting Lou, Haoqing Yu, Haoyue Yang, Hua Wu, Jing Liu, Jingyao Li, Siqi Bao, Yifeng Kou, Zhangxiao Shen, Zhengfan Wu.

**Figure 2.** Figure 2: Cookie-Bench data construction pipeline and dataset statistics. The upper-left shows the data construction pipeline; the lower-right shows the dataset distribution statistics. No single sourcing channel satisfies both requirements alone, so the two regimes are chosen to cover each other’s blind spots. Naturalistic queries, contributing 514 entries drawn from real-user traffic on an internal WebDev product … view at source ↗

**Figure 3.** Figure 3: Overview of Cookie. A five-stage pipeline from code to score: Install & Start deploys the generated page; Static Evaluation captures logs and a VLM-scored screenshot; Interaction runs the Cookie agent through an Observe-Think-Act loop with human-like clicks; Score Adjustment grades issues at Critical, Major, or Minor severity across Functional and Aesthetic dimensions; Overall Scoring aggregates them into … view at source ↗

**Figure 4.** Figure 4: Model capability landscape on Cookie-Bench. Left: Per-model Functionality–Aesthetics [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Per-category human agreement rates (%) for ablated evaluation variants on 132 queries. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Rendered screenshot of the generated Super Mario game as seen by the static verifier. The [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: Interaction frame captured during the agent’s gap-crossing attempt. Mario is positioned at [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

**Figure 8.** Figure 8: Per-model average scores across language, difficulty tier, and L2 category under React (top [PITH_FULL_IMAGE:figures/full_fig_p039_8.png] view at source ↗

read the original abstract

Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new evaluation regime that is simultaneously reference-free, autonomously driven, and holistically reasoned, and instantiate it through two artifacts. \textbf{\dataname} is an 11-domain, 54-leaf, 1,000-query WebDev benchmark spanning both static-presentation and interactive-application tasks, balanced across three difficulty tiers and three target-language groups, with briefs rewritten to resist recall from circulated prompts. \textbf{\framename}, grounded in Flavell's metacognitive monitoring, separates evidence accumulation from judgment across three stages: Static Perception forms a first impression from passive observation; Agent-Driven Interaction explores the application autonomously while capturing continuous screen video, audio, and per-step screenshots; Dynamic Scoring issues holistic functionality and aesthetics verdicts with structured failure attribution only after the evidence chain is complete. On \dataname, \framename aligns closely with expert human ratings while surfacing substantial headroom across 13 frontier LLMs on interactive web generation. \noindenthttps://anonymous.4open.science/r/Cookie-3CE/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cookie-Bench and Cookie-Frame introduce a reference-free three-stage autonomous eval for LLM web apps, but the alignment-with-humans claim has no supporting metrics visible.

read the letter

The colleague should know two things up front. First, the paper builds Cookie-Bench, a 1,000-query set spanning 11 domains and three difficulty tiers for both static and interactive web tasks, with prompts rewritten to limit leakage. Second, it pairs the benchmark with Cookie-Frame, a three-stage procedure that does static perception, then lets an agent drive continuous screen recording and interaction, and only afterward produces holistic functionality and aesthetics scores with failure attribution.

What is actually new is the explicit separation of evidence gathering from judgment, drawing on Flavell's metacognitive monitoring to justify the staged process. The benchmark construction itself looks careful on paper: balanced across languages and difficulties, and aimed at tasks where reference implementations are impractical. That addresses a real scaling problem with human leaderboards like Arena.

The soft spot is the central claim. The abstract states that Cookie-Frame aligns closely with expert human ratings and shows headroom on 13 LLMs, yet supplies no correlation numbers, agreement statistics, sample sizes, or ablation results comparing the staged method to direct scoring. The stress-test note is right that the load-bearing step—mapping the captured evidence chain to verdicts that match live human synthesis—remains unquantified in the visible description. Without those data, the equivalence cannot be assessed, and the headroom result inherits the same uncertainty.

This paper is for researchers working on automated evaluation of LLM-generated interactive applications. A reader focused on benchmark design or agent-based testing could extract useful ideas from the framework even if the validation is incomplete. It deserves a serious referee because the problem is concrete and the approach is distinct from checklist or reference-based methods, even though the current evidence for the main result is thin. I would send it to peer review so the authors can supply the missing metrics and any internal checks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Cookie-Bench, an 11-domain, 54-leaf, 1,000-query benchmark for LLM-generated web applications spanning static and interactive tasks across difficulty tiers and languages, with prompts rewritten to avoid recall. It also presents Cookie-Frame, a reference-free three-stage framework (Static Perception, Agent-Driven Interaction via continuous screen capture, and Dynamic Scoring) grounded in Flavell's metacognitive monitoring, claiming close alignment with expert human ratings and substantial headroom across 13 frontier LLMs on interactive web generation.

Significance. If the alignment claim holds with quantitative support, the work could supply a scalable, autonomous alternative to human-judged leaderboards for evaluating complex interactive front-end code, addressing the scalability limits of existing reference- or test-suite-based proxies while enabling holistic, reasoned verdicts.

major comments (2)

[Abstract] Abstract: the central claim that Cookie-Frame 'aligns closely with expert human ratings' is asserted without any reported metrics (e.g., correlation coefficients, agreement statistics, sample sizes, or inter-rater reliability). This is load-bearing for the primary contribution, as the evaluation regime's three-stage separation and mapping from captured evidence to structured verdicts remains unvalidated against human live-session synthesis.
[Abstract] Abstract and evaluation regime description: the assertion of 'substantial headroom across 13 frontier LLMs' inherits the same validation gap; without ablation results comparing staged scoring to direct LLM scoring or details on how the autonomous agent replicates reasoned human judgment, the headroom result cannot be assessed for robustness.

minor comments (2)

The benchmark construction mentions 'briefs rewritten to resist recall from circulated prompts,' but provides no concrete details on the rewriting process or verification method.
Notation for the two artifacts uses placeholder macros (\dataname, \framename) in the abstract; consistent naming (Cookie-Bench, Cookie-Frame) should be used throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need to make quantitative support for the alignment and headroom claims explicit in the abstract. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that Cookie-Frame 'aligns closely with expert human ratings' is asserted without any reported metrics (e.g., correlation coefficients, agreement statistics, sample sizes, or inter-rater reliability). This is load-bearing for the primary contribution, as the evaluation regime's three-stage separation and mapping from captured evidence to structured verdicts remains unvalidated against human live-session synthesis.

Authors: We agree that the abstract should report the supporting metrics rather than asserting alignment without them. The full manuscript contains a human study with Pearson correlation, sample size, and inter-rater reliability figures validating the three-stage regime against expert ratings on live sessions. We will revise the abstract to include these quantitative results so that the validation of the staged evidence-to-verdict mapping is visible at the abstract level. revision: yes
Referee: [Abstract] Abstract and evaluation regime description: the assertion of 'substantial headroom across 13 frontier LLMs' inherits the same validation gap; without ablation results comparing staged scoring to direct LLM scoring or details on how the autonomous agent replicates reasoned human judgment, the headroom result cannot be assessed for robustness.

Authors: The headroom result is obtained by running the complete Cookie-Frame pipeline (including agent-driven continuous interaction and post-evidence dynamic scoring) on the 13 models; the primary validation remains the correlation with human ratings rather than an internal ablation against direct LLM scoring. We will add a concise clarification in the abstract and evaluation section describing how the metacognition-inspired separation of perception, interaction, and scoring is intended to approximate human live-session synthesis. An explicit ablation against direct LLM scoring is not present in the current manuscript and would require additional experiments; we therefore treat this as a partial revision focused on textual clarification. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper presents Cookie-Bench as a new reference-free benchmark and Cookie-Frame as a three-stage evaluation process grounded explicitly in Flavell's external metacognitive monitoring concept. No equations, fitted parameters, self-citations, or ansatzes appear in the abstract or described framework. The alignment claim with human ratings is positioned as an empirical outcome rather than a quantity derived by construction from the inputs. The derivation chain introduces new artifacts without reducing any prediction or uniqueness result to its own definitions or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified premise that the described agent interaction plus delayed scoring produces judgments equivalent to human holistic review; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption The three-stage process (Static Perception, Agent-Driven Interaction, Dynamic Scoring) grounded in Flavell's metacognitive monitoring accurately captures human reasoned synthesis over live sessions.
Invoked when the abstract states the framework separates evidence accumulation from judgment and aligns with expert ratings.

pith-pipeline@v0.9.1-grok · 5811 in / 1286 out tokens · 28784 ms · 2026-06-29T07:21:41.504842+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

95 extracted references · 21 canonical work pages · 3 internal anchors

[1]

Claude builds visuals

Anthropic. Claude builds visuals. https://claude.com/blog/claude-builds-visuals, 2026. Ac- cessed: 2026-04-23

2026
[2]

Claude opus 4.6

Anthropic. Claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 , 2026. Ac- cessed: 2026-05-01

2026
[3]

Claude opus 4.7

Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 , 2026. Ac- cessed: 2026-05-01

2026
[4]

Y . Chen, M. Liu, Y . Shen, Y . Li, T. Huang, X. Fang, T. Zheng, W. Huang, C. Yang, D. Fu, et al. Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025

work page arXiv 2025
[5]

DeepSeek-V4

DeepSeek AI. DeepSeek-V4. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/ main/DeepSeek_V4.pdf, 2026. Accessed: 2026-05-01

2026
[6]

J. H. Flavell. Metacognition and cognitive monitoring: A new area of cognitive–developmental inquiry. American psychologist, 34(10):906, 1979

1979
[7]

Gemini 3

Google. Gemini 3. https://aistudio.google.com/models/gemini-3, 2026. Accessed: 2026-04- 23. 10

2026
[8]

Gemini 3.1 Pro

Google DeepMind. Gemini 3.1 Pro. https://deepmind.google/technologies/gemini/, 2026. Accessed: 2026-05-01

2026
[9]

Y . Gui, Z. Li, Y . Wan, Y . Shi, H. Zhang, Y . Su, S. Dong, X. Zhou, and W. Jiang. Vision2ui: A real-world dataset with layout for code generation from ui designs.CoRR, 2024

2024
[10]

Y . Gui, Z. Li, Y . Wan, Y . Shi, H. Zhang, B. Chen, Y . Su, D. Chen, S. Wu, X. Zhou, et al. Webcode2m: A real-world dataset for code generation from webpage designs. InProceedings of the ACM on Web Conference (WWW 2025), pages 1834–1845, 2025

2025
[11]

H. Guo, W. Zhang, J. Chen, Y . Gu, J. Yang, J. Du, S. Cao, B. Hui, T. Liu, J. Ma, et al. Iw-bench: Evaluating large multimodal models for converting image-to-web. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6449–6466, 2025

2025
[12]

Z. He, W. Hong, Z. Yang, Z. Pan, M. Liu, X. Gu, and J. Tang. Vision2web: A hierarchical benchmark for visual website development with agent verification.arXiv preprint arXiv:2603.26648, 2026

work page arXiv 2026
[13]

S. Jung, A. Garcinuno, and S. Mateega. Ui-bench: A benchmark for evaluating design capabilities of ai text-to-app tools.arXiv preprint arXiv:2508.20410, 2025

work page arXiv 2025
[14]

F. Kong, J. Zhang, Y . Yue, C. Sun, Y . Tian, S. Feng, X. Yang, D. Wang, Y . Tian, J. Du, et al. Webtest- bench: Evaluating computer-use agents towards end-to-end automated web testing.arXiv preprint arXiv:2603.25226, 2026

work page arXiv 2026
[15]

P. Lai, J. Zhuang, K. Zhang, N. Xiong, S. Wang, Y . Xu, C. Chen, Y . Wang, and B. Cui. Webrenderbench: Enhancing web interface generation through layout-style consistency and reinforcement learning.arXiv preprint arXiv:2510.04097, 2025

work page arXiv 2025
[16]

Laurençon, L

H. Laurençon, L. Tronchon, and V . Sanh. Unlocking the conversion of web screenshots into html code with the websight dataset.arXiv preprint arXiv:2403.09029, 2024

work page arXiv 2024
[17]

X. Lei, X. Che, J. Xiong, C. Zhang, Y . Huang, C. Zhou, H. Huang, M. Liu, L. Zhu, H. Ye, J. Hao, K. Deng, Z. Zhan, H. Li, D. Li, Y . Yao, M. Sun, Z. Zhang, and J. Liu. Webcompass: Towards multimodal web coding evaluation for code language models, 2026. URLhttps://arxiv.org/abs/2604.18224

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

K. Q. Lin, S. Hu, L. Li, Z. Yang, L. Wang, P. Torr, and M. Z. Shou. Computer-use agents as judges for generative user interface.arXiv preprint arXiv:2511.15567, 2025

work page arXiv 2025
[19]

C. Liu, Y . Fu, W. Yang, Y . Zhang, and T. Xie. Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

work page arXiv 2026
[20]

Z. Lu, Y . Yang, H. Ren, H. Hou, H. Xiao, K. Wang, W. Shi, A. Zhou, M. Zhan, and H. Li. Webgen- bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

work page arXiv 2025
[21]

Z. Lu, H. Ren, Y . Yang, K. Wang, Z. Zong, M. Zhan, and H. Li. Fullstack-agent: Enhancing agentic full-stack web coding via development-oriented testing and repository back-translation.arXiv preprint arXiv:2602.03798, 2026

work page arXiv 2026
[22]

Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026

Moonshot AI. Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026. Accessed: 2026-05-01

2026
[23]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Ac- cessed: 2026-05-01

2026
[24]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-4/, 2026. Accessed: 2026-04-23

2026
[25]

Z. Peng, W. Tao, X. Yin, C. Ying, Y . Luo, and Y . Guo. Playcoder: Making llm-generated gui code playable,
[26]

URLhttps://arxiv.org/abs/2604.19742

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Robertson

S. Robertson. Understanding inverse document frequency: on theoretical arguments for idf.Journal of documentation, 60(5):503–520, 2004

2004
[28]

Sadowski and G

C. Sadowski and G. Levin. Simhash: Hash-based similarity detection. Technical report, Technical report, Google, 2007

2007
[29]

C. Si, Y . Zhang, R. Li, Z. Yang, R. Liu, and D. Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 3956–3974, Albuquerque, New Mexico, Apr. 20...

2025
[30]

H. Sun, H. W. Wang, J. Gu, L. Li, and Y . Cheng. Fullfront: Benchmarking mllms across the full front-end engineering workflow.arXiv preprint arXiv:2505.17399, 2025

work page arXiv 2025
[31]

Y . Wan, Y . Dong, J. Xiao, Y . Huo, W. Wang, and M. R. Lyu. Mrweb: An exploration of generating multi-page resource-aware web code from ui designs.arXiv preprint arXiv:2412.15310, 2024

work page arXiv 2024
[32]

Wu, Y .-H

J. Wu, Y .-H. Peng, X. Y . A. Li, A. Swearngin, J. P. Bigham, and J. Nichols. Uiclip: a data-driven model for assessing user interface design. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, pages 1–16, 2024

2024
[33]

B. Xiao, L. Jiang, S. Huang, T. Lv, Y . Huang, X. Wu, L. Cui, and F. Wei. Code aesthetics with agentic reward feedback.arXiv preprint arXiv:2510.23272, 2025

work page arXiv 2025
[34]

J. Xiao, Y . Wan, Y . Huo, Z. Wang, X. Xu, W. Wang, Z. Xu, Y . Wang, and M. R. Lyu. Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 241–253. IEEE, 2025

2025
[35]

J. Xiao, M. Wang, M. H. Lam, Y . Wan, J. Liu, Y . Huo, and M. R. Lyu. Designbench: A comprehensive benchmark for mllm-based front-end code generation.arXiv preprint arXiv:2506.06251, 2025

work page arXiv 2025
[36]

Mimo V2 Pro.https://mimo.xiaomi.com/mimo-v2-pro, 2026

Xiaomi. Mimo V2 Pro.https://mimo.xiaomi.com/mimo-v2-pro, 2026. Accessed: 2026-05-01

2026
[37]

K. Xu, Y . Mao, X. Guan, and Z. Feng. Web-bench: A llm code benchmark based on web standards and frameworks.arXiv preprint arXiv:2505.07473, 2025

work page arXiv 2025
[38]

M. Xu, Z. Yang, W. Hong, L. Pan, X. Fan, Y . Wang, X. Gu, B. Xu, and J. Tang. Webvia: A web-based vision-language agentic framework for interactive and verifiable ui-to-code generation.arXiv preprint arXiv:2511.06251, 2025

work page arXiv 2025
[39]

S. Yun, H. Lin, R. Thushara, M. Q. Bhat, Y . Wang, Z. Jiang, M. Deng, J. Wang, T. Tao, J. Li, H. Li, P. Nakov, T. Baldwin, Z. Liu, E. P. Xing, X. Liang, and Z. Shen. Web2code: A large-scale webpage-to-code dataset and evaluation framework for multimodal llms. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 112134–112157. Cu...

2024
[40]

Zhang, Y

C. Zhang, Y . Li, C. Xu, J. Liu, A. Liu, C. Zhou, K. Deng, D. Wu, G. Huang, K. Li, et al. Artifactsbench: Bridging the visual-interactive gap in llm code generation evaluation.arXiv preprint arXiv:2507.04952, 2025

work page arXiv 2025
[41]

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

Z. Zhang, C. Yu, Y . Li, C. Zhuang, L. Mo, and S. Li. Miniappbench: Evaluating the shift from text to interactive html responses in llm-powered assistants.arXiv preprint arXiv:2603.09652, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

GLM-5.1.https://z.ai/blog/glm-5.1, 2026

Zhipu AI. GLM-5.1.https://z.ai/blog/glm-5.1, 2026. Accessed: 2026-05-01

2026
[43]

task_scenario

H. Zhu, Y . Zhang, B. Zhao, J. Ding, S. Liu, T. Liu, D. Wang, Y . Liu, and Z. Li. Frontendbench: A benchmark for evaluating llms on front-end development via automatic evaluation.arXiv preprint arXiv:2506.13832, 2025. 12 Appendix Contents A. Cookie-Bench Benchmark Data: Supplementary Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

work page arXiv 2025
[44]

Environment freeze.During the agent’s deliberation phase (between observation and action selection), the application state is paused to prevent temporal drift. This ensures that the page state the agent reasons about remains consistent with the state it subsequently acts upon, avoiding evaluation artifacts caused by animations, timers, or asynchronous upd...
[45]

Multi-modal evidence capture.A continuous capture pipeline records screen video and audio streams alongside per-step screenshots throughout the interaction session. Unlike discrete snapshot- based approaches, this preserves the full temporal evolution of application behavior, including animation timing, transition smoothness, loading-state flicker, and au...
[46]

Human-like input simulation.Rather than issuing instantaneous programmatic inputs, the agent introduces realistic interaction rhythms: gradual mouse movements, natural typing cadence, and appropriate pauses between actions. This prevents evaluation artifacts that arise when applications behave differently under programmatic versus human-speed input (e.g.,...
[47]

If no visible change occurred, say so explicitly

Primacy of Evidence: Document every action and its directly observed result. If no visible change occurred, say so explicitly
[48]

no change observed

Anti-Stagnation: If the page state is identical for 3 consecutive observation cycles while waiting for something (e.g., a response, animation, or load), note this as "no change observed" and move on
[49]

Anti-Loop: If you find yourself repeating the same action or sequence without a new outcome, break the loop immediately and document it
[50]

If still broken, exit immediately and report it in your overall_observation -- do not attempt workarounds

Garbled Page Handling: If the page appears garbled or unreadable (encoding issues), refresh up to 2 times. If still broken, exit immediately and report it in your overall_observation -- do not attempt workarounds
[51]

Do NOT document or comment on any programming language mismatch -- just test what is in front of you

Programming Language: The output project is expected to be a web application (HTML, React, etc.), regardless of what the query specifies. Do NOT document or comment on any programming language mismatch -- just test what is in front of you
[52]

Page Language: Note in your overall_observation if the page language does not match the language requirement in the user query
[53]

No Scoring Required: You are NOT evaluating quality -- just testing functionality and documenting observations. II. WEB APPLICATION TESTING PROTOCOL Step 1: Test Core Functionality Test the CORE LOGIC first. Interact with the main workflow users will use the website, and experience its primary purpose. Step 2: Test Interactive Elements Systematically test:
[54]

Navigation: All menu items, links, breadcrumbs
[55]

Buttons: All primary and secondary buttons
[56]

Forms: Input fields, dropdowns, checkboxes, validation
[57]

Dynamic Content: Tabs, accordions, modals, tooltips
[58]

Media: Images, videos, carousels load status
[59]

Tested 3/20 product cards, all functional

Search/Filter: Any search or filtering functionality III. GAME TESTING PROTOCOL If the application appears to be a game, apply the following protocol instead of (or in addition to) the Web Application Testing Protocol above. Non-Real-Time Interactive Games (card games, turn-based games, strategy games, puzzle games, etc.): - Try to test the complete game ...
[60]

Actively plan your steps, prioritize core functionality, and stop testing in time to prepare output before reaching the limit

Step Limit: You MUST conclude and produce your final JSON output within {max_steps} steps. Actively plan your steps, prioritize core functionality, and stop testing in time to prepare output before reaching the limit
[61]

Environment Limitations (DO NOT Test -- Assume Working): - Backend/Database: Login systems, user authentication, data persistence - Third-Party APIs: LLM APIs, payment gateways, social media APIs, map services - File Operations: File upload/download functionality - Email/SMS: Email sending, SMS verification
[62]

Console Errors -- Only Report Critical Errors: Record only JavaScript errors that cause observable malfunction in the page or directly break functionality. Do NOT report: - Console warnings of any kind - Font loading failures (fonts.googleapis.com, fonts.gstatic.com) - Favicon 404 errors - CDN resource failures that do not visibly break the page
[63]

Focus on Frontend Interactions: Test what's visibly interactive in the browser
[64]

Prioritize Critical Path: Test main workflow first, then secondary elements
[65]

actions_performed

No Edge Cases: Do NOT test extreme or edge case inputs. VII. FINAL OUTPUT FORMAT Provide a comprehensive interaction summary as a JSON object. This summary will be used by a VLM evaluator to assess the application quality. Important: You MUST provide this JSON output before reaching the {max_steps}-step limit. { "actions_performed": [ "Navigated to {url}"...
[66]

Webpage Screenshot: For visual and layout audit
[67]

Source Code: To review logic, event handlers, and implementation quality
[68]

Original User Query: To verify if requirements and language match
[69]

Browser Console Logs & Dev Server Output: To detect hidden functional crashes or warnings. II. SCORING SYSTEM (0.0 - 8.0 SCALE) Programming language specification: The output project MUST be in html/react WHATEVER the query specified. DO NOT DEDUCT points for the difference from user query. Page language check: The web page's displayed language (in contra...
[70]

start->interact->win/lose->restart/next

FUNCTIONAL SCORING (REQUIREMENT-DRIVEN AUDIT) Step 1: Assess Implementation Completeness - 8.0 points: ALL user requirements fully implemented with correct logic - 7.0 points: ALL user requirements implemented, only minor features missing or error (e.g. form validation is incomplete) - 5.0 points: Core requirements implemented, some key features missing o...
[71]

ready to build

AESTHETIC SCORING (DEFECT-BASED ELITE STANDARDS) STEP 1: BASELINE - 5.0 points: High-quality, clean, and modern. Standard professional work. - 3.0 points: Functional but unpolished. With issues. - 1.0 points: Raw HTML elements with no styling or only very basic CSS. - 0.0 points: App fails to render, is blank, shows raw code or placeholder page (e.g. "rea...
[72]

Detailed step-by-step explanation
[73]

functional_reason

Clear deduction breakdown with math { "functional_reason": "Step 1: [Base score, reason]. Step 2: [Instruction following audit - query alignment, elaboration, hallucination, template check, language]. Step 3: [Source code interactive elements verification]. Step 4: [Data display check]. Step 5: [Console error audit]. Calculation: [show math]. Final: X.X",...
[74]

Confirm problems found in video - these are DEFINITE issues
[75]

Do NOT dismiss code-level problems from static evaluation - unless the video explicitly proves they don't exist
[76]

Identify:

If static evaluation identified a problem in the code, assume it exists UNLESS: - The video explicitly demonstrates the feature working correctly - The video shows the problematic code path executing without issues Your Task Carefully review the initial screenshot, the source code context, and the interaction video /frames. Identify:
[77]

NEW problems discovered in video - issues that became apparent during interaction
[78]

could be better

CONFIRMED code problems - issues mentioned in static evaluation that are NOT disproven by the video Focus Areas 27 Functional Problems CRITICAL Severity (suggest -2.0 or more each): - Core logic complete failure: Application crashes, infinite loop, or becomes completely unusable (blocks ALL usage) - Language mismatch: Page language doesn't match user quer...
[79]

Match severity to EXACT deduction amounts: Don't guess - use the mappings above
[80]

Report NEW problems from video: Issues that became visible during interaction

Showing first 80 references.

[1] [1]

Claude builds visuals

Anthropic. Claude builds visuals. https://claude.com/blog/claude-builds-visuals, 2026. Ac- cessed: 2026-04-23

2026

[2] [2]

Claude opus 4.6

Anthropic. Claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 , 2026. Ac- cessed: 2026-05-01

2026

[3] [3]

Claude opus 4.7

Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 , 2026. Ac- cessed: 2026-05-01

2026

[4] [4]

Y . Chen, M. Liu, Y . Shen, Y . Li, T. Huang, X. Fang, T. Zheng, W. Huang, C. Yang, D. Fu, et al. Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025

work page arXiv 2025

[5] [5]

DeepSeek-V4

DeepSeek AI. DeepSeek-V4. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/ main/DeepSeek_V4.pdf, 2026. Accessed: 2026-05-01

2026

[6] [6]

J. H. Flavell. Metacognition and cognitive monitoring: A new area of cognitive–developmental inquiry. American psychologist, 34(10):906, 1979

1979

[7] [7]

Gemini 3

Google. Gemini 3. https://aistudio.google.com/models/gemini-3, 2026. Accessed: 2026-04- 23. 10

2026

[8] [8]

Gemini 3.1 Pro

Google DeepMind. Gemini 3.1 Pro. https://deepmind.google/technologies/gemini/, 2026. Accessed: 2026-05-01

2026

[9] [9]

Y . Gui, Z. Li, Y . Wan, Y . Shi, H. Zhang, Y . Su, S. Dong, X. Zhou, and W. Jiang. Vision2ui: A real-world dataset with layout for code generation from ui designs.CoRR, 2024

2024

[10] [10]

Y . Gui, Z. Li, Y . Wan, Y . Shi, H. Zhang, B. Chen, Y . Su, D. Chen, S. Wu, X. Zhou, et al. Webcode2m: A real-world dataset for code generation from webpage designs. InProceedings of the ACM on Web Conference (WWW 2025), pages 1834–1845, 2025

2025

[11] [11]

H. Guo, W. Zhang, J. Chen, Y . Gu, J. Yang, J. Du, S. Cao, B. Hui, T. Liu, J. Ma, et al. Iw-bench: Evaluating large multimodal models for converting image-to-web. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6449–6466, 2025

2025

[12] [12]

Z. He, W. Hong, Z. Yang, Z. Pan, M. Liu, X. Gu, and J. Tang. Vision2web: A hierarchical benchmark for visual website development with agent verification.arXiv preprint arXiv:2603.26648, 2026

work page arXiv 2026

[13] [13]

S. Jung, A. Garcinuno, and S. Mateega. Ui-bench: A benchmark for evaluating design capabilities of ai text-to-app tools.arXiv preprint arXiv:2508.20410, 2025

work page arXiv 2025

[14] [14]

F. Kong, J. Zhang, Y . Yue, C. Sun, Y . Tian, S. Feng, X. Yang, D. Wang, Y . Tian, J. Du, et al. Webtest- bench: Evaluating computer-use agents towards end-to-end automated web testing.arXiv preprint arXiv:2603.25226, 2026

work page arXiv 2026

[15] [15]

P. Lai, J. Zhuang, K. Zhang, N. Xiong, S. Wang, Y . Xu, C. Chen, Y . Wang, and B. Cui. Webrenderbench: Enhancing web interface generation through layout-style consistency and reinforcement learning.arXiv preprint arXiv:2510.04097, 2025

work page arXiv 2025

[16] [16]

Laurençon, L

H. Laurençon, L. Tronchon, and V . Sanh. Unlocking the conversion of web screenshots into html code with the websight dataset.arXiv preprint arXiv:2403.09029, 2024

work page arXiv 2024

[17] [17]

X. Lei, X. Che, J. Xiong, C. Zhang, Y . Huang, C. Zhou, H. Huang, M. Liu, L. Zhu, H. Ye, J. Hao, K. Deng, Z. Zhan, H. Li, D. Li, Y . Yao, M. Sun, Z. Zhang, and J. Liu. Webcompass: Towards multimodal web coding evaluation for code language models, 2026. URLhttps://arxiv.org/abs/2604.18224

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

K. Q. Lin, S. Hu, L. Li, Z. Yang, L. Wang, P. Torr, and M. Z. Shou. Computer-use agents as judges for generative user interface.arXiv preprint arXiv:2511.15567, 2025

work page arXiv 2025

[19] [19]

C. Liu, Y . Fu, W. Yang, Y . Zhang, and T. Xie. Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

work page arXiv 2026

[20] [20]

Z. Lu, Y . Yang, H. Ren, H. Hou, H. Xiao, K. Wang, W. Shi, A. Zhou, M. Zhan, and H. Li. Webgen- bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

work page arXiv 2025

[21] [21]

Z. Lu, H. Ren, Y . Yang, K. Wang, Z. Zong, M. Zhan, and H. Li. Fullstack-agent: Enhancing agentic full-stack web coding via development-oriented testing and repository back-translation.arXiv preprint arXiv:2602.03798, 2026

work page arXiv 2026

[22] [22]

Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026

Moonshot AI. Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026. Accessed: 2026-05-01

2026

[23] [23]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Ac- cessed: 2026-05-01

2026

[24] [24]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-4/, 2026. Accessed: 2026-04-23

2026

[25] [25]

Z. Peng, W. Tao, X. Yin, C. Ying, Y . Luo, and Y . Guo. Playcoder: Making llm-generated gui code playable,

[26] [26]

URLhttps://arxiv.org/abs/2604.19742

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Robertson

S. Robertson. Understanding inverse document frequency: on theoretical arguments for idf.Journal of documentation, 60(5):503–520, 2004

2004

[28] [28]

Sadowski and G

C. Sadowski and G. Levin. Simhash: Hash-based similarity detection. Technical report, Technical report, Google, 2007

2007

[29] [29]

C. Si, Y . Zhang, R. Li, Z. Yang, R. Liu, and D. Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 3956–3974, Albuquerque, New Mexico, Apr. 20...

2025

[30] [30]

H. Sun, H. W. Wang, J. Gu, L. Li, and Y . Cheng. Fullfront: Benchmarking mllms across the full front-end engineering workflow.arXiv preprint arXiv:2505.17399, 2025

work page arXiv 2025

[31] [31]

Y . Wan, Y . Dong, J. Xiao, Y . Huo, W. Wang, and M. R. Lyu. Mrweb: An exploration of generating multi-page resource-aware web code from ui designs.arXiv preprint arXiv:2412.15310, 2024

work page arXiv 2024

[32] [32]

Wu, Y .-H

J. Wu, Y .-H. Peng, X. Y . A. Li, A. Swearngin, J. P. Bigham, and J. Nichols. Uiclip: a data-driven model for assessing user interface design. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, pages 1–16, 2024

2024

[33] [33]

B. Xiao, L. Jiang, S. Huang, T. Lv, Y . Huang, X. Wu, L. Cui, and F. Wei. Code aesthetics with agentic reward feedback.arXiv preprint arXiv:2510.23272, 2025

work page arXiv 2025

[34] [34]

J. Xiao, Y . Wan, Y . Huo, Z. Wang, X. Xu, W. Wang, Z. Xu, Y . Wang, and M. R. Lyu. Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 241–253. IEEE, 2025

2025

[35] [35]

J. Xiao, M. Wang, M. H. Lam, Y . Wan, J. Liu, Y . Huo, and M. R. Lyu. Designbench: A comprehensive benchmark for mllm-based front-end code generation.arXiv preprint arXiv:2506.06251, 2025

work page arXiv 2025

[36] [36]

Mimo V2 Pro.https://mimo.xiaomi.com/mimo-v2-pro, 2026

Xiaomi. Mimo V2 Pro.https://mimo.xiaomi.com/mimo-v2-pro, 2026. Accessed: 2026-05-01

2026

[37] [37]

K. Xu, Y . Mao, X. Guan, and Z. Feng. Web-bench: A llm code benchmark based on web standards and frameworks.arXiv preprint arXiv:2505.07473, 2025

work page arXiv 2025

[38] [38]

M. Xu, Z. Yang, W. Hong, L. Pan, X. Fan, Y . Wang, X. Gu, B. Xu, and J. Tang. Webvia: A web-based vision-language agentic framework for interactive and verifiable ui-to-code generation.arXiv preprint arXiv:2511.06251, 2025

work page arXiv 2025

[39] [39]

S. Yun, H. Lin, R. Thushara, M. Q. Bhat, Y . Wang, Z. Jiang, M. Deng, J. Wang, T. Tao, J. Li, H. Li, P. Nakov, T. Baldwin, Z. Liu, E. P. Xing, X. Liang, and Z. Shen. Web2code: A large-scale webpage-to-code dataset and evaluation framework for multimodal llms. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 112134–112157. Cu...

2024

[40] [40]

Zhang, Y

C. Zhang, Y . Li, C. Xu, J. Liu, A. Liu, C. Zhou, K. Deng, D. Wu, G. Huang, K. Li, et al. Artifactsbench: Bridging the visual-interactive gap in llm code generation evaluation.arXiv preprint arXiv:2507.04952, 2025

work page arXiv 2025

[41] [41]

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

Z. Zhang, C. Yu, Y . Li, C. Zhuang, L. Mo, and S. Li. Miniappbench: Evaluating the shift from text to interactive html responses in llm-powered assistants.arXiv preprint arXiv:2603.09652, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

GLM-5.1.https://z.ai/blog/glm-5.1, 2026

Zhipu AI. GLM-5.1.https://z.ai/blog/glm-5.1, 2026. Accessed: 2026-05-01

2026

[43] [43]

task_scenario

H. Zhu, Y . Zhang, B. Zhao, J. Ding, S. Liu, T. Liu, D. Wang, Y . Liu, and Z. Li. Frontendbench: A benchmark for evaluating llms on front-end development via automatic evaluation.arXiv preprint arXiv:2506.13832, 2025. 12 Appendix Contents A. Cookie-Bench Benchmark Data: Supplementary Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

work page arXiv 2025

[44] [44]

Environment freeze.During the agent’s deliberation phase (between observation and action selection), the application state is paused to prevent temporal drift. This ensures that the page state the agent reasons about remains consistent with the state it subsequently acts upon, avoiding evaluation artifacts caused by animations, timers, or asynchronous upd...

[45] [45]

Multi-modal evidence capture.A continuous capture pipeline records screen video and audio streams alongside per-step screenshots throughout the interaction session. Unlike discrete snapshot- based approaches, this preserves the full temporal evolution of application behavior, including animation timing, transition smoothness, loading-state flicker, and au...

[46] [46]

Human-like input simulation.Rather than issuing instantaneous programmatic inputs, the agent introduces realistic interaction rhythms: gradual mouse movements, natural typing cadence, and appropriate pauses between actions. This prevents evaluation artifacts that arise when applications behave differently under programmatic versus human-speed input (e.g.,...

[47] [47]

If no visible change occurred, say so explicitly

Primacy of Evidence: Document every action and its directly observed result. If no visible change occurred, say so explicitly

[48] [48]

no change observed

Anti-Stagnation: If the page state is identical for 3 consecutive observation cycles while waiting for something (e.g., a response, animation, or load), note this as "no change observed" and move on

[49] [49]

Anti-Loop: If you find yourself repeating the same action or sequence without a new outcome, break the loop immediately and document it

[50] [50]

If still broken, exit immediately and report it in your overall_observation -- do not attempt workarounds

Garbled Page Handling: If the page appears garbled or unreadable (encoding issues), refresh up to 2 times. If still broken, exit immediately and report it in your overall_observation -- do not attempt workarounds

[51] [51]

Do NOT document or comment on any programming language mismatch -- just test what is in front of you

Programming Language: The output project is expected to be a web application (HTML, React, etc.), regardless of what the query specifies. Do NOT document or comment on any programming language mismatch -- just test what is in front of you

[52] [52]

Page Language: Note in your overall_observation if the page language does not match the language requirement in the user query

[53] [53]

No Scoring Required: You are NOT evaluating quality -- just testing functionality and documenting observations. II. WEB APPLICATION TESTING PROTOCOL Step 1: Test Core Functionality Test the CORE LOGIC first. Interact with the main workflow users will use the website, and experience its primary purpose. Step 2: Test Interactive Elements Systematically test:

[54] [54]

Navigation: All menu items, links, breadcrumbs

[55] [55]

Buttons: All primary and secondary buttons

[56] [56]

Forms: Input fields, dropdowns, checkboxes, validation

[57] [57]

Dynamic Content: Tabs, accordions, modals, tooltips

[58] [58]

Media: Images, videos, carousels load status

[59] [59]

Tested 3/20 product cards, all functional

Search/Filter: Any search or filtering functionality III. GAME TESTING PROTOCOL If the application appears to be a game, apply the following protocol instead of (or in addition to) the Web Application Testing Protocol above. Non-Real-Time Interactive Games (card games, turn-based games, strategy games, puzzle games, etc.): - Try to test the complete game ...

[60] [60]

Actively plan your steps, prioritize core functionality, and stop testing in time to prepare output before reaching the limit

Step Limit: You MUST conclude and produce your final JSON output within {max_steps} steps. Actively plan your steps, prioritize core functionality, and stop testing in time to prepare output before reaching the limit

[61] [61]

Environment Limitations (DO NOT Test -- Assume Working): - Backend/Database: Login systems, user authentication, data persistence - Third-Party APIs: LLM APIs, payment gateways, social media APIs, map services - File Operations: File upload/download functionality - Email/SMS: Email sending, SMS verification

[62] [62]

Console Errors -- Only Report Critical Errors: Record only JavaScript errors that cause observable malfunction in the page or directly break functionality. Do NOT report: - Console warnings of any kind - Font loading failures (fonts.googleapis.com, fonts.gstatic.com) - Favicon 404 errors - CDN resource failures that do not visibly break the page

[63] [63]

Focus on Frontend Interactions: Test what's visibly interactive in the browser

[64] [64]

Prioritize Critical Path: Test main workflow first, then secondary elements

[65] [65]

actions_performed

No Edge Cases: Do NOT test extreme or edge case inputs. VII. FINAL OUTPUT FORMAT Provide a comprehensive interaction summary as a JSON object. This summary will be used by a VLM evaluator to assess the application quality. Important: You MUST provide this JSON output before reaching the {max_steps}-step limit. { "actions_performed": [ "Navigated to {url}"...

[66] [66]

Webpage Screenshot: For visual and layout audit

[67] [67]

Source Code: To review logic, event handlers, and implementation quality

[68] [68]

Original User Query: To verify if requirements and language match

[69] [69]

Browser Console Logs & Dev Server Output: To detect hidden functional crashes or warnings. II. SCORING SYSTEM (0.0 - 8.0 SCALE) Programming language specification: The output project MUST be in html/react WHATEVER the query specified. DO NOT DEDUCT points for the difference from user query. Page language check: The web page's displayed language (in contra...

[70] [70]

start->interact->win/lose->restart/next

FUNCTIONAL SCORING (REQUIREMENT-DRIVEN AUDIT) Step 1: Assess Implementation Completeness - 8.0 points: ALL user requirements fully implemented with correct logic - 7.0 points: ALL user requirements implemented, only minor features missing or error (e.g. form validation is incomplete) - 5.0 points: Core requirements implemented, some key features missing o...

[71] [71]

ready to build

AESTHETIC SCORING (DEFECT-BASED ELITE STANDARDS) STEP 1: BASELINE - 5.0 points: High-quality, clean, and modern. Standard professional work. - 3.0 points: Functional but unpolished. With issues. - 1.0 points: Raw HTML elements with no styling or only very basic CSS. - 0.0 points: App fails to render, is blank, shows raw code or placeholder page (e.g. "rea...

[72] [72]

Detailed step-by-step explanation

[73] [73]

functional_reason

Clear deduction breakdown with math { "functional_reason": "Step 1: [Base score, reason]. Step 2: [Instruction following audit - query alignment, elaboration, hallucination, template check, language]. Step 3: [Source code interactive elements verification]. Step 4: [Data display check]. Step 5: [Console error audit]. Calculation: [show math]. Final: X.X",...

[74] [74]

Confirm problems found in video - these are DEFINITE issues

[75] [75]

Do NOT dismiss code-level problems from static evaluation - unless the video explicitly proves they don't exist

[76] [76]

Identify:

If static evaluation identified a problem in the code, assume it exists UNLESS: - The video explicitly demonstrates the feature working correctly - The video shows the problematic code path executing without issues Your Task Carefully review the initial screenshot, the source code context, and the interaction video /frames. Identify:

[77] [77]

NEW problems discovered in video - issues that became apparent during interaction

[78] [78]

could be better

CONFIRMED code problems - issues mentioned in static evaluation that are NOT disproven by the video Focus Areas 27 Functional Problems CRITICAL Severity (suggest -2.0 or more each): - Core logic complete failure: Application crashes, infinite loop, or becomes completely unusable (blocks ALL usage) - Language mismatch: Page language doesn't match user quer...

[79] [79]

Match severity to EXACT deduction amounts: Don't guess - use the mappings above

[80] [80]

Report NEW problems from video: Issues that became visible during interaction