Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch

· 2025 · arXiv 2505.03733

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1 baseline 1 dataset 1

citation-polarity summary

background 1 baseline 1 use dataset 1

representative citing papers

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

Cookie-Bench is a reference-free 1,000-query web development benchmark paired with Cookie-Frame, a metacognition-inspired three-stage framework (static perception, agent interaction, dynamic scoring) that aligns with human ratings on 13 frontier LLMs.

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

cs.AI · 2026-05-17 · unverdicted · novelty 7.0 · 2 refs

WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

cs.MA · 2026-05-06 · conditional · novelty 7.0

SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering quality, and security weaknesses.

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

cs.SE · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% on RealDevBench.

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much larger models.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

MM-WebAgent is a hierarchical multimodal agent that coordinates AIGC tools through planning and iterative self-reflection to generate coherent, visually consistent webpages and outperforms baselines on a new benchmark.

I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications

cs.CL · 2026-05-30 · unverdicted · novelty 5.0

A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer