pith. sign in

arxiv: 2605.30000 · v2 · pith:JO5N76H6new · submitted 2026-05-28 · 💻 cs.AI

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Pith reviewed 2026-06-29 07:21 UTC · model grok-4.3

classification 💻 cs.AI
keywords web generationLLM evaluationinteractive benchmarksagent-based evaluationfront-end developmentreference-free evaluationcontinuous screen interactionmetacognitive monitoring
0
0 comments X

The pith

Cookie-Frame matches expert human ratings on interactive web generation without references or test suites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Cookie-Bench, a reference-free benchmark of one thousand web-development queries across eleven domains and three difficulty levels, to test both static pages and interactive applications. It introduces Cookie-Frame, a three-stage evaluator that first forms a static impression, then lets an autonomous agent explore the live interface while recording continuous screen video and per-step screenshots, and finally issues holistic functionality and aesthetics scores with failure attribution. This regime is designed to replicate the reasoned synthesis a human reviewer performs during a live session. On the benchmark the method aligns closely with expert human ratings and identifies substantial performance gaps among thirteen frontier LLMs. The approach therefore supports scalable, autonomous evaluation of web-generation models without requiring reference implementations.

Core claim

Cookie-Bench supplies an 11-domain, 54-leaf, 1000-query WebDev benchmark balanced across static-presentation and interactive-application tasks; Cookie-Frame implements a metacognition-inspired regime that separates evidence accumulation (static perception plus agent-driven continuous screen interaction) from holistic judgment (dynamic scoring), achieving close alignment with expert human ratings while exposing headroom across frontier LLMs on interactive web generation.

What carries the argument

Cookie-Frame, the three-stage process of static perception, agent-driven interaction with continuous screen-video capture, and post-evidence dynamic scoring with structured failure attribution.

If this is right

  • Evaluation of LLM-generated interactive web applications can proceed at scale without human judges or reference code at each iteration.
  • Current frontier models exhibit measurable shortfalls on both functionality and aesthetics when judged under continuous-interaction conditions.
  • The same reference-free regime applies equally to static presentation tasks and to dynamic application tasks.
  • Structured failure attribution produced after full evidence collection supplies actionable diagnostic signals for model improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The continuous screen-capture record could serve as training data for reward models that learn to predict human preference directly from interaction traces.
  • The separation of evidence accumulation from judgment may generalize to other GUI domains such as mobile or desktop application generation.
  • Because the benchmark resists recall of circulated prompts, repeated use of the same queries is less likely to inflate reported performance over time.

Load-bearing premise

An autonomous agent performing continuous screen interaction and later holistic scoring can replicate the reasoned synthesis a human reviewer performs over a live session without any reference implementation or test suite.

What would settle it

A side-by-side study in which multiple expert human raters independently score the same set of generated web applications and the resulting scores diverge substantially from Cookie-Frame verdicts on a non-negligible fraction of cases.

Figures

Figures reproduced from arXiv: 2605.30000 by Fan Ding, Hangting Lou, Haoqing Yu, Haoyue Yang, Hua Wu, Jing Liu, Jingyao Li, Siqi Bao, Yifeng Kou, Zhangxiao Shen, Zhengfan Wu.

Figure 1
Figure 1. Figure 1: Top: Query “Super Mario” flowing through deployment, autonomous agent-driven [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cookie-Bench data construction pipeline and dataset statistics. The upper-left shows the data construction pipeline; the lower-right shows the dataset distribution statistics. No single sourcing channel satisfies both requirements alone, so the two regimes are chosen to cover each other’s blind spots. Naturalistic queries, contributing 514 entries drawn from real-user traffic on an internal WebDev product … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Cookie. A five-stage pipeline from code to score: Install & Start deploys the generated page; Static Evaluation captures logs and a VLM-scored screenshot; Interaction runs the Cookie agent through an Observe-Think-Act loop with human-like clicks; Score Adjustment grades issues at Critical, Major, or Minor severity across Functional and Aesthetic dimensions; Overall Scoring aggregates them into … view at source ↗
Figure 4
Figure 4. Figure 4: Model capability landscape on Cookie-Bench. Left: Per-model Functionality–Aesthetics [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-category human agreement rates (%) for ablated evaluation variants on 132 queries. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Rendered screenshot of the generated Super Mario game as seen by the static verifier. The [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Interaction frame captured during the agent’s gap-crossing attempt. Mario is positioned at [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-model average scores across language, difficulty tier, and L2 category under React (top [PITH_FULL_IMAGE:figures/full_fig_p039_8.png] view at source ↗
read the original abstract

Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new evaluation regime that is simultaneously reference-free, autonomously driven, and holistically reasoned, and instantiate it through two artifacts. \textbf{\dataname} is an 11-domain, 54-leaf, 1,000-query WebDev benchmark spanning both static-presentation and interactive-application tasks, balanced across three difficulty tiers and three target-language groups, with briefs rewritten to resist recall from circulated prompts. \textbf{\framename}, grounded in Flavell's metacognitive monitoring, separates evidence accumulation from judgment across three stages: Static Perception forms a first impression from passive observation; Agent-Driven Interaction explores the application autonomously while capturing continuous screen video, audio, and per-step screenshots; Dynamic Scoring issues holistic functionality and aesthetics verdicts with structured failure attribution only after the evidence chain is complete. On \dataname, \framename aligns closely with expert human ratings while surfacing substantial headroom across 13 frontier LLMs on interactive web generation. \noindenthttps://anonymous.4open.science/r/Cookie-3CE/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Cookie-Bench, an 11-domain, 54-leaf, 1,000-query benchmark for LLM-generated web applications spanning static and interactive tasks across difficulty tiers and languages, with prompts rewritten to avoid recall. It also presents Cookie-Frame, a reference-free three-stage framework (Static Perception, Agent-Driven Interaction via continuous screen capture, and Dynamic Scoring) grounded in Flavell's metacognitive monitoring, claiming close alignment with expert human ratings and substantial headroom across 13 frontier LLMs on interactive web generation.

Significance. If the alignment claim holds with quantitative support, the work could supply a scalable, autonomous alternative to human-judged leaderboards for evaluating complex interactive front-end code, addressing the scalability limits of existing reference- or test-suite-based proxies while enabling holistic, reasoned verdicts.

major comments (2)
  1. [Abstract] Abstract: the central claim that Cookie-Frame 'aligns closely with expert human ratings' is asserted without any reported metrics (e.g., correlation coefficients, agreement statistics, sample sizes, or inter-rater reliability). This is load-bearing for the primary contribution, as the evaluation regime's three-stage separation and mapping from captured evidence to structured verdicts remains unvalidated against human live-session synthesis.
  2. [Abstract] Abstract and evaluation regime description: the assertion of 'substantial headroom across 13 frontier LLMs' inherits the same validation gap; without ablation results comparing staged scoring to direct LLM scoring or details on how the autonomous agent replicates reasoned human judgment, the headroom result cannot be assessed for robustness.
minor comments (2)
  1. The benchmark construction mentions 'briefs rewritten to resist recall from circulated prompts,' but provides no concrete details on the rewriting process or verification method.
  2. Notation for the two artifacts uses placeholder macros (\dataname, \framename) in the abstract; consistent naming (Cookie-Bench, Cookie-Frame) should be used throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need to make quantitative support for the alignment and headroom claims explicit in the abstract. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Cookie-Frame 'aligns closely with expert human ratings' is asserted without any reported metrics (e.g., correlation coefficients, agreement statistics, sample sizes, or inter-rater reliability). This is load-bearing for the primary contribution, as the evaluation regime's three-stage separation and mapping from captured evidence to structured verdicts remains unvalidated against human live-session synthesis.

    Authors: We agree that the abstract should report the supporting metrics rather than asserting alignment without them. The full manuscript contains a human study with Pearson correlation, sample size, and inter-rater reliability figures validating the three-stage regime against expert ratings on live sessions. We will revise the abstract to include these quantitative results so that the validation of the staged evidence-to-verdict mapping is visible at the abstract level. revision: yes

  2. Referee: [Abstract] Abstract and evaluation regime description: the assertion of 'substantial headroom across 13 frontier LLMs' inherits the same validation gap; without ablation results comparing staged scoring to direct LLM scoring or details on how the autonomous agent replicates reasoned human judgment, the headroom result cannot be assessed for robustness.

    Authors: The headroom result is obtained by running the complete Cookie-Frame pipeline (including agent-driven continuous interaction and post-evidence dynamic scoring) on the 13 models; the primary validation remains the correlation with human ratings rather than an internal ablation against direct LLM scoring. We will add a concise clarification in the abstract and evaluation section describing how the metacognition-inspired separation of perception, interaction, and scoring is intended to approximate human live-session synthesis. An explicit ablation against direct LLM scoring is not present in the current manuscript and would require additional experiments; we therefore treat this as a partial revision focused on textual clarification. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper presents Cookie-Bench as a new reference-free benchmark and Cookie-Frame as a three-stage evaluation process grounded explicitly in Flavell's external metacognitive monitoring concept. No equations, fitted parameters, self-citations, or ansatzes appear in the abstract or described framework. The alignment claim with human ratings is positioned as an empirical outcome rather than a quantity derived by construction from the inputs. The derivation chain introduces new artifacts without reducing any prediction or uniqueness result to its own definitions or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified premise that the described agent interaction plus delayed scoring produces judgments equivalent to human holistic review; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption The three-stage process (Static Perception, Agent-Driven Interaction, Dynamic Scoring) grounded in Flavell's metacognitive monitoring accurately captures human reasoned synthesis over live sessions.
    Invoked when the abstract states the framework separates evidence accumulation from judgment and aligns with expert ratings.

pith-pipeline@v0.9.1-grok · 5811 in / 1286 out tokens · 28784 ms · 2026-06-29T07:21:41.504842+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

95 extracted references · 21 canonical work pages · 3 internal anchors

  1. [1]

    Claude builds visuals

    Anthropic. Claude builds visuals. https://claude.com/blog/claude-builds-visuals, 2026. Ac- cessed: 2026-04-23

  2. [2]

    Claude opus 4.6

    Anthropic. Claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 , 2026. Ac- cessed: 2026-05-01

  3. [3]

    Claude opus 4.7

    Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 , 2026. Ac- cessed: 2026-05-01

  4. [4]

    Y . Chen, M. Liu, Y . Shen, Y . Li, T. Huang, X. Fang, T. Zheng, W. Huang, C. Yang, D. Fu, et al. Iwr-bench: Can lvlms reconstruct interactive webpage from a user interaction video?arXiv preprint arXiv:2509.24709, 2025

  5. [5]

    DeepSeek-V4

    DeepSeek AI. DeepSeek-V4. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/ main/DeepSeek_V4.pdf, 2026. Accessed: 2026-05-01

  6. [6]

    J. H. Flavell. Metacognition and cognitive monitoring: A new area of cognitive–developmental inquiry. American psychologist, 34(10):906, 1979

  7. [7]

    Gemini 3

    Google. Gemini 3. https://aistudio.google.com/models/gemini-3, 2026. Accessed: 2026-04- 23. 10

  8. [8]

    Gemini 3.1 Pro

    Google DeepMind. Gemini 3.1 Pro. https://deepmind.google/technologies/gemini/, 2026. Accessed: 2026-05-01

  9. [9]

    Y . Gui, Z. Li, Y . Wan, Y . Shi, H. Zhang, Y . Su, S. Dong, X. Zhou, and W. Jiang. Vision2ui: A real-world dataset with layout for code generation from ui designs.CoRR, 2024

  10. [10]

    Y . Gui, Z. Li, Y . Wan, Y . Shi, H. Zhang, B. Chen, Y . Su, D. Chen, S. Wu, X. Zhou, et al. Webcode2m: A real-world dataset for code generation from webpage designs. InProceedings of the ACM on Web Conference (WWW 2025), pages 1834–1845, 2025

  11. [11]

    H. Guo, W. Zhang, J. Chen, Y . Gu, J. Yang, J. Du, S. Cao, B. Hui, T. Liu, J. Ma, et al. Iw-bench: Evaluating large multimodal models for converting image-to-web. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6449–6466, 2025

  12. [12]

    Z. He, W. Hong, Z. Yang, Z. Pan, M. Liu, X. Gu, and J. Tang. Vision2web: A hierarchical benchmark for visual website development with agent verification.arXiv preprint arXiv:2603.26648, 2026

  13. [13]

    S. Jung, A. Garcinuno, and S. Mateega. Ui-bench: A benchmark for evaluating design capabilities of ai text-to-app tools.arXiv preprint arXiv:2508.20410, 2025

  14. [14]

    F. Kong, J. Zhang, Y . Yue, C. Sun, Y . Tian, S. Feng, X. Yang, D. Wang, Y . Tian, J. Du, et al. Webtest- bench: Evaluating computer-use agents towards end-to-end automated web testing.arXiv preprint arXiv:2603.25226, 2026

  15. [15]

    P. Lai, J. Zhuang, K. Zhang, N. Xiong, S. Wang, Y . Xu, C. Chen, Y . Wang, and B. Cui. Webrenderbench: Enhancing web interface generation through layout-style consistency and reinforcement learning.arXiv preprint arXiv:2510.04097, 2025

  16. [16]

    Laurençon, L

    H. Laurençon, L. Tronchon, and V . Sanh. Unlocking the conversion of web screenshots into html code with the websight dataset.arXiv preprint arXiv:2403.09029, 2024

  17. [17]

    X. Lei, X. Che, J. Xiong, C. Zhang, Y . Huang, C. Zhou, H. Huang, M. Liu, L. Zhu, H. Ye, J. Hao, K. Deng, Z. Zhan, H. Li, D. Li, Y . Yao, M. Sun, Z. Zhang, and J. Liu. Webcompass: Towards multimodal web coding evaluation for code language models, 2026. URLhttps://arxiv.org/abs/2604.18224

  18. [18]

    K. Q. Lin, S. Hu, L. Li, Z. Yang, L. Wang, P. Torr, and M. Z. Shou. Computer-use agents as judges for generative user interface.arXiv preprint arXiv:2511.15567, 2025

  19. [19]

    C. Liu, Y . Fu, W. Yang, Y . Zhang, and T. Xie. Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

  20. [20]

    Z. Lu, Y . Yang, H. Ren, H. Hou, H. Xiao, K. Wang, W. Shi, A. Zhou, M. Zhan, and H. Li. Webgen- bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

  21. [21]

    Z. Lu, H. Ren, Y . Yang, K. Wang, Z. Zong, M. Zhan, and H. Li. Fullstack-agent: Enhancing agentic full-stack web coding via development-oriented testing and repository back-translation.arXiv preprint arXiv:2602.03798, 2026

  22. [22]

    Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026

    Moonshot AI. Kimi K2.6.https://www.kimi.com/blog/kimi-k2-6, 2026. Accessed: 2026-05-01

  23. [23]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Ac- cessed: 2026-05-01

  24. [24]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-4/, 2026. Accessed: 2026-04-23

  25. [25]

    Z. Peng, W. Tao, X. Yin, C. Ying, Y . Luo, and Y . Guo. Playcoder: Making llm-generated gui code playable,

  26. [26]

    URLhttps://arxiv.org/abs/2604.19742

  27. [27]

    Robertson

    S. Robertson. Understanding inverse document frequency: on theoretical arguments for idf.Journal of documentation, 60(5):503–520, 2004

  28. [28]

    Sadowski and G

    C. Sadowski and G. Levin. Simhash: Hash-based similarity detection. Technical report, Technical report, Google, 2007

  29. [29]

    C. Si, Y . Zhang, R. Li, Z. Yang, R. Liu, and D. Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 3956–3974, Albuquerque, New Mexico, Apr. 20...

  30. [30]

    H. Sun, H. W. Wang, J. Gu, L. Li, and Y . Cheng. Fullfront: Benchmarking mllms across the full front-end engineering workflow.arXiv preprint arXiv:2505.17399, 2025

  31. [31]

    Y . Wan, Y . Dong, J. Xiao, Y . Huo, W. Wang, and M. R. Lyu. Mrweb: An exploration of generating multi-page resource-aware web code from ui designs.arXiv preprint arXiv:2412.15310, 2024

  32. [32]

    Wu, Y .-H

    J. Wu, Y .-H. Peng, X. Y . A. Li, A. Swearngin, J. P. Bigham, and J. Nichols. Uiclip: a data-driven model for assessing user interface design. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, pages 1–16, 2024

  33. [33]

    B. Xiao, L. Jiang, S. Huang, T. Lv, Y . Huang, X. Wu, L. Cui, and F. Wei. Code aesthetics with agentic reward feedback.arXiv preprint arXiv:2510.23272, 2025

  34. [34]

    J. Xiao, Y . Wan, Y . Huo, Z. Wang, X. Xu, W. Wang, Z. Xu, Y . Wang, and M. R. Lyu. Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 241–253. IEEE, 2025

  35. [35]

    J. Xiao, M. Wang, M. H. Lam, Y . Wan, J. Liu, Y . Huo, and M. R. Lyu. Designbench: A comprehensive benchmark for mllm-based front-end code generation.arXiv preprint arXiv:2506.06251, 2025

  36. [36]

    Mimo V2 Pro.https://mimo.xiaomi.com/mimo-v2-pro, 2026

    Xiaomi. Mimo V2 Pro.https://mimo.xiaomi.com/mimo-v2-pro, 2026. Accessed: 2026-05-01

  37. [37]

    K. Xu, Y . Mao, X. Guan, and Z. Feng. Web-bench: A llm code benchmark based on web standards and frameworks.arXiv preprint arXiv:2505.07473, 2025

  38. [38]

    M. Xu, Z. Yang, W. Hong, L. Pan, X. Fan, Y . Wang, X. Gu, B. Xu, and J. Tang. Webvia: A web-based vision-language agentic framework for interactive and verifiable ui-to-code generation.arXiv preprint arXiv:2511.06251, 2025

  39. [39]

    S. Yun, H. Lin, R. Thushara, M. Q. Bhat, Y . Wang, Z. Jiang, M. Deng, J. Wang, T. Tao, J. Li, H. Li, P. Nakov, T. Baldwin, Z. Liu, E. P. Xing, X. Liang, and Z. Shen. Web2code: A large-scale webpage-to-code dataset and evaluation framework for multimodal llms. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 112134–112157. Cu...

  40. [40]

    Zhang, Y

    C. Zhang, Y . Li, C. Xu, J. Liu, A. Liu, C. Zhou, K. Deng, D. Wu, G. Huang, K. Li, et al. Artifactsbench: Bridging the visual-interactive gap in llm code generation evaluation.arXiv preprint arXiv:2507.04952, 2025

  41. [41]

    MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

    Z. Zhang, C. Yu, Y . Li, C. Zhuang, L. Mo, and S. Li. Miniappbench: Evaluating the shift from text to interactive html responses in llm-powered assistants.arXiv preprint arXiv:2603.09652, 2026

  42. [42]

    GLM-5.1.https://z.ai/blog/glm-5.1, 2026

    Zhipu AI. GLM-5.1.https://z.ai/blog/glm-5.1, 2026. Accessed: 2026-05-01

  43. [43]

    task_scenario

    H. Zhu, Y . Zhang, B. Zhao, J. Ding, S. Liu, T. Liu, D. Wang, Y . Liu, and Z. Li. Frontendbench: A benchmark for evaluating llms on front-end development via automatic evaluation.arXiv preprint arXiv:2506.13832, 2025. 12 Appendix Contents A. Cookie-Bench Benchmark Data: Supplementary Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

  44. [44]

    Environment freeze.During the agent’s deliberation phase (between observation and action selection), the application state is paused to prevent temporal drift. This ensures that the page state the agent reasons about remains consistent with the state it subsequently acts upon, avoiding evaluation artifacts caused by animations, timers, or asynchronous upd...

  45. [45]

    Multi-modal evidence capture.A continuous capture pipeline records screen video and audio streams alongside per-step screenshots throughout the interaction session. Unlike discrete snapshot- based approaches, this preserves the full temporal evolution of application behavior, including animation timing, transition smoothness, loading-state flicker, and au...

  46. [46]

    Human-like input simulation.Rather than issuing instantaneous programmatic inputs, the agent introduces realistic interaction rhythms: gradual mouse movements, natural typing cadence, and appropriate pauses between actions. This prevents evaluation artifacts that arise when applications behave differently under programmatic versus human-speed input (e.g.,...

  47. [47]

    If no visible change occurred, say so explicitly

    Primacy of Evidence: Document every action and its directly observed result. If no visible change occurred, say so explicitly

  48. [48]

    no change observed

    Anti-Stagnation: If the page state is identical for 3 consecutive observation cycles while waiting for something (e.g., a response, animation, or load), note this as "no change observed" and move on

  49. [49]

    Anti-Loop: If you find yourself repeating the same action or sequence without a new outcome, break the loop immediately and document it

  50. [50]

    If still broken, exit immediately and report it in your overall_observation -- do not attempt workarounds

    Garbled Page Handling: If the page appears garbled or unreadable (encoding issues), refresh up to 2 times. If still broken, exit immediately and report it in your overall_observation -- do not attempt workarounds

  51. [51]

    Do NOT document or comment on any programming language mismatch -- just test what is in front of you

    Programming Language: The output project is expected to be a web application (HTML, React, etc.), regardless of what the query specifies. Do NOT document or comment on any programming language mismatch -- just test what is in front of you

  52. [52]

    Page Language: Note in your overall_observation if the page language does not match the language requirement in the user query

  53. [53]

    No Scoring Required: You are NOT evaluating quality -- just testing functionality and documenting observations. II. WEB APPLICATION TESTING PROTOCOL Step 1: Test Core Functionality Test the CORE LOGIC first. Interact with the main workflow users will use the website, and experience its primary purpose. Step 2: Test Interactive Elements Systematically test:

  54. [54]

    Navigation: All menu items, links, breadcrumbs

  55. [55]

    Buttons: All primary and secondary buttons

  56. [56]

    Forms: Input fields, dropdowns, checkboxes, validation

  57. [57]

    Dynamic Content: Tabs, accordions, modals, tooltips

  58. [58]

    Media: Images, videos, carousels load status

  59. [59]

    Tested 3/20 product cards, all functional

    Search/Filter: Any search or filtering functionality III. GAME TESTING PROTOCOL If the application appears to be a game, apply the following protocol instead of (or in addition to) the Web Application Testing Protocol above. Non-Real-Time Interactive Games (card games, turn-based games, strategy games, puzzle games, etc.): - Try to test the complete game ...

  60. [60]

    Actively plan your steps, prioritize core functionality, and stop testing in time to prepare output before reaching the limit

    Step Limit: You MUST conclude and produce your final JSON output within {max_steps} steps. Actively plan your steps, prioritize core functionality, and stop testing in time to prepare output before reaching the limit

  61. [61]

    Environment Limitations (DO NOT Test -- Assume Working): - Backend/Database: Login systems, user authentication, data persistence - Third-Party APIs: LLM APIs, payment gateways, social media APIs, map services - File Operations: File upload/download functionality - Email/SMS: Email sending, SMS verification

  62. [62]

    Console Errors -- Only Report Critical Errors: Record only JavaScript errors that cause observable malfunction in the page or directly break functionality. Do NOT report: - Console warnings of any kind - Font loading failures (fonts.googleapis.com, fonts.gstatic.com) - Favicon 404 errors - CDN resource failures that do not visibly break the page

  63. [63]

    Focus on Frontend Interactions: Test what's visibly interactive in the browser

  64. [64]

    Prioritize Critical Path: Test main workflow first, then secondary elements

  65. [65]

    actions_performed

    No Edge Cases: Do NOT test extreme or edge case inputs. VII. FINAL OUTPUT FORMAT Provide a comprehensive interaction summary as a JSON object. This summary will be used by a VLM evaluator to assess the application quality. Important: You MUST provide this JSON output before reaching the {max_steps}-step limit. { "actions_performed": [ "Navigated to {url}"...

  66. [66]

    Webpage Screenshot: For visual and layout audit

  67. [67]

    Source Code: To review logic, event handlers, and implementation quality

  68. [68]

    Original User Query: To verify if requirements and language match

  69. [69]

    Browser Console Logs & Dev Server Output: To detect hidden functional crashes or warnings. II. SCORING SYSTEM (0.0 - 8.0 SCALE) Programming language specification: The output project MUST be in html/react WHATEVER the query specified. DO NOT DEDUCT points for the difference from user query. Page language check: The web page's displayed language (in contra...

  70. [70]

    start->interact->win/lose->restart/next

    FUNCTIONAL SCORING (REQUIREMENT-DRIVEN AUDIT) Step 1: Assess Implementation Completeness - 8.0 points: ALL user requirements fully implemented with correct logic - 7.0 points: ALL user requirements implemented, only minor features missing or error (e.g. form validation is incomplete) - 5.0 points: Core requirements implemented, some key features missing o...

  71. [71]

    ready to build

    AESTHETIC SCORING (DEFECT-BASED ELITE STANDARDS) STEP 1: BASELINE - 5.0 points: High-quality, clean, and modern. Standard professional work. - 3.0 points: Functional but unpolished. With issues. - 1.0 points: Raw HTML elements with no styling or only very basic CSS. - 0.0 points: App fails to render, is blank, shows raw code or placeholder page (e.g. "rea...

  72. [72]

    Detailed step-by-step explanation

  73. [73]

    functional_reason

    Clear deduction breakdown with math { "functional_reason": "Step 1: [Base score, reason]. Step 2: [Instruction following audit - query alignment, elaboration, hallucination, template check, language]. Step 3: [Source code interactive elements verification]. Step 4: [Data display check]. Step 5: [Console error audit]. Calculation: [show math]. Final: X.X",...

  74. [74]

    Confirm problems found in video - these are DEFINITE issues

  75. [75]

    Do NOT dismiss code-level problems from static evaluation - unless the video explicitly proves they don't exist

  76. [76]

    Identify:

    If static evaluation identified a problem in the code, assume it exists UNLESS: - The video explicitly demonstrates the feature working correctly - The video shows the problematic code path executing without issues Your Task Carefully review the initial screenshot, the source code context, and the interaction video /frames. Identify:

  77. [77]

    NEW problems discovered in video - issues that became apparent during interaction

  78. [78]

    could be better

    CONFIRMED code problems - issues mentioned in static evaluation that are NOT disproven by the video Focus Areas 27 Functional Problems CRITICAL Severity (suggest -2.0 or more each): - Core logic complete failure: Application crashes, infinite loop, or becomes completely unusable (blocks ALL usage) - Language mismatch: Page language doesn't match user quer...

  79. [79]

    Match severity to EXACT deduction amounts: Don't guess - use the mappings above

  80. [80]

    Report NEW problems from video: Issues that became visible during interaction

Showing first 80 references.