WebGameBench is a benchmark that evaluates coding agents by having them generate browser-native games from specifications, then running those games in a real browser to assign EXCELLENT, USABLE, or UNUSABLE labels, with validation against human review.
Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
WebGameBench is a benchmark that evaluates coding agents by having them generate browser-native games from specifications, then running those games in a real browser to assign EXCELLENT, USABLE, or UNUSABLE labels, with validation against human review.