Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works

mark each acceptance item as passed, failed, or unverified

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

cs.AI · 2026-05-17 · unverdicted · novelty 7.0 · 2 refs

WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.

citing papers explorer

Showing 1 of 1 citing paper.

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games cs.AI · 2026-05-17 · unverdicted · none · ref 55 · 2 links
WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.

Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works

fields

years

verdicts

representative citing papers

citing papers explorer