Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch

WebGen-Bench: Evaluating LLMs on Generating Interactive, Functional Websites from Scratch , author= · 2025 · arXiv 2505.03733

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

cs.MA · 2026-05-06 · conditional · novelty 7.0

SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering quality, and security weaknesses.

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

cs.SE · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% on RealDevBench.

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much larger models.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

MM-WebAgent is a hierarchical multimodal agent that coordinates AIGC tools through planning and iterative self-reflection to generate coherent, visually consistent webpages and outperforms baselines on a new benchmark.

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

cs.AI · 2026-05-17

citing papers explorer

Showing 8 of 8 citing papers.

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering cs.SE · 2026-05-17 · unverdicted · none · ref 37
SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.
From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements cs.SE · 2026-05-17 · unverdicted · none · ref 29
TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.
GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection cs.LG · 2026-05-08 · unverdicted · none · ref 30
GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.
SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies cs.MA · 2026-05-06 · conditional · none · ref 11
SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering quality, and security weaknesses.
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents cs.SE · 2026-05-17 · unverdicted · none · ref 10 · 2 links
DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% on RealDevBench.
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning cs.CL · 2026-04-22 · unverdicted · none · ref 28
WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much larger models.
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation cs.CV · 2026-04-16 · unverdicted · none · ref 16
MM-WebAgent is a hierarchical multimodal agent that coordinates AIGC tools through planning and iterative self-reflection to generate coherent, visually consistent webpages and outperforms baselines on a new benchmark.
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games cs.AI · 2026-05-17 · unreviewed · ref 13

Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer