WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

Daxiang Dong; Guoliang You; Haoran Wang; Haotian Zhao; Jianmin Wu; Jingnan Gu; Mingyang Dai; Tianlun; Tianshu Zhu; Wenyu Zhang

arxiv: 2605.17637 · v1 · pith:EE2J7NHHnew · submitted 2026-05-17 · 💻 cs.AI

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

Wenyu Zhang , Guoliang You , Tianlun , Haotian Zhao , Tianshu Zhu , Haoran Wang , Xiaoxuan Tang , Mingyang Dai

show 3 more authors

Jingnan Gu Daxiang Dong Jianmin Wu

This is my paper

Pith reviewed 2026-05-20 12:19 UTC · model grok-4.3

classification 💻 cs.AI

keywords coding agentsbrowser gamesrequirement evaluationAI benchmarksruntime testingsoftware generationapplication delivery

0 comments

The pith

Coding agents produce usable browser games from requirements but rarely achieve full satisfaction of those requirements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

WebGameBench evaluates coding agents by converting a structured game specification into a running browser application and testing the delivered product directly. A runtime evaluator interacts with the game inside a real browser to assign one of three labels based on how well it meets the original requirements. Experiments across many agents and tasks show that usable outputs are common while excellent ones that fully satisfy the spec remain rare, indicating that getting a game to run is easier than getting it to match every detail.

Core claim

The paper establishes that current coding agents can cross the threshold of delivering a playable browser game yet rarely reach complete requirement satisfaction. In tests involving 111 tasks, 12 agents, and 14 configurations, the best usable rate reaches 76.9 percent while the excellent rate reaches only 20.2 percent. The runtime labels align with human gameplay review on a reviewed subset, supporting the claim that the gap between minimum playable delivery and full requirement fulfillment persists.

What carries the argument

WebGameBench, a requirement-to-application benchmark that deploys generated browser games under a unified protocol and uses a runtime evaluator to interact with the running application and assign EXCELLENT, USABLE, or UNUSABLE labels.

Load-bearing premise

The runtime evaluator's interaction with the delivered game in a browser accurately measures whether the application satisfies the original requirements.

What would settle it

A larger human review of generated games that reveals systematic mismatches between the runtime-assigned labels and independent judgments of requirement satisfaction.

Figures

Figures reproduced from arXiv: 2605.17637 by Daxiang Dong, Guoliang You, Haoran Wang, Haotian Zhao, Jianmin Wu, Jingnan Gu, Mingyang Dai, Tianlun, Tianshu Zhu, Wenyu Zhang, Xiaoxuan Tang.

**Figure 2.** Figure 2: Overview of the WebGameBench pipeline. Each task is defined by a frozen Structured [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: WebGameBench corpus profile over 111 browser-native game requirements, including [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Agreement between the runtime evaluator and human-review labels. The first three [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Two representative runtime case studies in the main text. Each task compares Kimi K2.5 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Supplementary runtime case studies for the four tasks omitted from the main text. Each [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE. On a human-reviewed subset, the runtime label is broadly aligned with human gameplay review under the Usable-rate criterion. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, WebGameBench separates current systems: the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate. This gap shows that crossing the minimum playable-delivery threshold is still far from complete requirement satisfaction. To our knowledge, WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces WebGameBench as a requirement-to-application benchmark for coding agents, where agents must generate browser-native games from frozen Structured WebGame Specifications. Using a runtime evaluator that interacts with the game in a real browser, it assigns labels of EXCELLENT, USABLE, or UNUSABLE. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, the best setup achieves a 76.9% usable rate but only a 20.2% excellent rate. The work claims that this demonstrates a gap between minimum playable delivery and complete requirement satisfaction, and that it is the first such benchmark to validate runtime labels against human gameplay review under the Usable-rate criterion on a reviewed subset.

Significance. If the central results hold, WebGameBench offers a valuable new evaluation framework that shifts focus from code or traces to delivered applications in a realistic browser environment. The benchmark's use of behavior-dense games as testbeds and the reported separation of agent performance highlight limitations in current coding agents. The external runtime execution and partial human validation provide a stronger grounding than many existing benchmarks. This could influence how future agent evaluations are designed, particularly for application-building tasks.

major comments (1)

Abstract: The alignment between the runtime evaluator and human review is described only as 'broadly aligned ... under the Usable-rate criterion' on a reviewed subset. No agreement metrics, subset size, or validation details are provided for the EXCELLENT label. Since the headline result contrasts the 76.9% usable rate with the 20.2% excellent rate to argue that playable delivery does not imply full requirement satisfaction, the absence of EXCELLENT-specific validation is a load-bearing concern for the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying this key point about the presentation of our validation results. We respond to the major comment below.

read point-by-point responses

Referee: Abstract: The alignment between the runtime evaluator and human review is described only as 'broadly aligned ... under the Usable-rate criterion' on a reviewed subset. No agreement metrics, subset size, or validation details are provided for the EXCELLENT label. Since the headline result contrasts the 76.9% usable rate with the 20.2% excellent rate to argue that playable delivery does not imply full requirement satisfaction, the absence of EXCELLENT-specific validation is a load-bearing concern for the central claim.

Authors: We agree that the abstract provides only a high-level description and does not report quantitative agreement metrics, the precise subset size, or dedicated validation details for the EXCELLENT label. The human review was designed to validate the runtime evaluator specifically under the Usable-rate criterion, which directly supports the distinction between minimum playable delivery and full requirement satisfaction. We will revise the abstract to state the reviewed subset size and any available agreement statistics for the Usable-rate. For the EXCELLENT label we will add an explicit note that it is assigned by the evaluator's comprehensive checks against the full Structured WebGame Specification and that separate human validation was not performed for this stricter category. These clarifications will be incorporated in the revised manuscript and expanded in the evaluation section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results grounded in external runtime execution and human review

full rationale

The paper's central results derive from running 12 coding agents on 111 tasks under 14 configurations, deploying generated games in a browser, and applying a runtime evaluator to produce EXCELLENT/USABLE/UNUSABLE labels. The reported gap (76.9% usable rate vs. 20.2% excellent rate) follows directly from these executions. Human alignment is cited only as external corroboration on a reviewed subset for the Usable-rate criterion, with no equations, parameter fitting, self-definitional loops, or load-bearing self-citations reducing the claims to their inputs by construction. The evaluation chain remains independent of the target metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that browser-native games form a suitable dense testbed for application-level evaluation and that automated browser interaction can proxy requirement satisfaction.

axioms (1)

domain assumption Browser-native games provide a compact but behavior-dense testbed requiring coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback.
Directly stated in the abstract as justification for the choice of testbed.

pith-pipeline@v0.9.0 · 5826 in / 1161 out tokens · 49152 ms · 2026-05-20T12:19:50.161606+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 11 internal anchors

[1]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke 9 Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, 2025

work page 2025
[3]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, 2024

work page 2024
[4]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

work page 2024
[6]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...

work page 2024
[7]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 2024

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 2024

work page 2024
[10]

Super: Evaluating agents on setting up and executing tasks from research repositories

Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. Super: Evaluating agents on setting up and executing tasks from research repositories. InProceedings of EMNLP, 2024

work page 2024
[11]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research.arXiv preprint arXiv:2504.01848, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. Mle- bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025. 10

work page arXiv 2025
[14]

Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, and Tao Xie. Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

work page arXiv 2026
[15]

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, and Yang Deng. E2edev: Bench- marking large language models in end-to-end software development task.arXiv preprint arXiv:2510.14509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Design2code: Benchmarking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of NAACL, 2025

work page 2025
[17]

Sketch2code: Evaluating vision-language models for interactive web design prototyping

Ryan Li, Yanzhe Zhang, and Diyi Yang. Sketch2code: Evaluating vision-language models for interactive web design prototyping. InProceedings of NAACL, 2025

work page 2025
[18]

Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zixin Wang, Xinyi Xu, Wenxuan Wang, Zhiyao Xu, Yuhang Wang, and Michael R. Lyu. Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. InProceedings of ASE, 2025

work page 2025
[19]

Webuibench: A comprehensive benchmark for evaluating multimodal large language models in webui-to-code

Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, and Xuelong Li. Webuibench: A comprehensive benchmark for evaluating multimodal large language models in webui-to-code. InFindings of ACL, 2025

work page 2025
[20]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

work page 2023
[21]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of ACL, 2024

work page 2024
[22]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of ACL, 2024

work page 2024
[23]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

work page arXiv 2024
[25]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018

Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Ruo Yu Tao, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018

work page arXiv 2018
[27]

Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan

Matthew J. Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Interactive fiction games: A colossal adventure. InAAAI Conference on Artificial Intelligence,

work page
[28]

URLhttps://api.semanticscholar.org/CorpusID:202565447

work page
[29]

Gaina, Julian Togelius, and Si- mon M

Diego Pérez-Liébana, Jialin Liu, Ahmed Khalifa, Raluca D. Gaina, Julian Togelius, and Si- mon M. Lucas. General video game ai: A multitrack framework for evaluating agents, games, and content generation algorithms.IEEE Transactions on Games, 2019. 11

work page 2019
[30]

William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, Manuela Veloso, and Phillip Wang. The minerl 2019 competition on sample efficient re- inforcement learning using human priors. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[31]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

work page arXiv 2024
[32]

Balrog: Benchmarking agentic llm and vlm reasoning on games

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. Balrog: Benchmarking agentic llm and vlm reasoning on games. InInternational Conference on Learning Representations, 2025

work page 2025
[33]

Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang

Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games? arXiv preprint arXiv:2505.15146, 2025

work page arXiv 2025
[34]

Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025

Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, and Marc- Alexandre Côté. Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025

work page arXiv 2025
[35]

Claude opus 4.7

Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 ,

work page
[37]

Claude opus 4.6

Anthropic. Claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 ,

work page
[39]

Claude sonnet 4.5

Anthropic. Claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5 ,

work page
[41]

Introducing gpt-5.5

OpenAI. Introducing gpt-5.5. https://openai.com/index/introducing-gpt-5-5/ ,

work page
[42]

Accessed: 2026-05-07

work page 2026
[43]

Gemini 3.1 pro: A smarter model for your most com- plex tasks

Google Gemini Team. Gemini 3.1 pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/, 2026. Accessed: 2026-05-07

work page 2026
[44]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026
[45]

Kimi k2.6: Scaling agentic intelligence

Moonshot AI. Kimi k2.6: Scaling agentic intelligence. https://www.kimi.com/blog/ kimi-k2-6, 2026. Accessed: 2026-05-07

work page 2026
[46]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, et al. Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Z.AI. Glm-5.1. https://docs.z.ai/guides/llm/glm-5.1, 2026. Accessed: 2026-05-07

work page 2026
[49]

Pressing the left arrow moves the player left

Tencent. Tencent unveils hy3 preview; model enhances agent capabilities and real-world usability. https://www.tencent.com/en-us/articles/2202320.html, 2026. Accessed: 2026-05-07. A Specification Construction Details This appendix records construction details that are intentionally compressed in the main text. The main text defines the Structured WebGame S...

work page arXiv 2026
[50]

open the deployed URL with Playwright and handle browser-access issues when needed

work page
[51]

perform a short smoke interaction to check loadability, entry, and the main playable state

work page
[52]

derive checks from the frozen specification and generic playable-loop requirements

work page
[53]

verify checks through user-level actions before relying on source code or logs

work page
[54]

record the pre-trigger state, trigger action, visible result, and relevant state or numeric deltas

work page
[55]

Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works

mark each acceptance item as passed, failed, or unverified. Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works. D.3 Candidate Preconditions For long-horizon, rare, or randomized states, code or runtime-state manipulation may be used to construct a candidate precondition. This operation does not itself verif...

work page
[56]

final conclusion: row or sample id, deployed URL, source reference, quality label, and one-sentence summary

work page
[57]

core functional checks: pass, fail, or unverified for the main playable loop

work page
[58]

other issues: non-core failures and whether they affect the final label

work page
[59]

acceptance results: per-requirement observations and evidence

work page
[60]

E Compute Resources and Trace Statistics All generation and evaluation jobs are executed through API calls on a CPU cluster

unverified items: reason, attempted interactions, and whether review is needed. E Compute Resources and Trace Statistics All generation and evaluation jobs are executed through API calls on a CPU cluster. We do not train models or perform GPU fine-tuning. Agent generation uses concurrency 6, and runtime evaluation uses concurrency 20; each concurrent job ...

work page

[1] [1]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke 9 Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, 2025

work page 2025

[3] [3]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, 2024

work page 2024

[4] [4]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

work page 2024

[6] [6]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...

work page 2024

[7] [7]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 2024

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 2024

work page 2024

[10] [10]

Super: Evaluating agents on setting up and executing tasks from research repositories

Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. Super: Evaluating agents on setting up and executing tasks from research repositories. InProceedings of EMNLP, 2024

work page 2024

[11] [11]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research.arXiv preprint arXiv:2504.01848, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. Mle- bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025. 10

work page arXiv 2025

[14] [14]

Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, and Tao Xie. Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

work page arXiv 2026

[15] [15]

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, and Yang Deng. E2edev: Bench- marking large language models in end-to-end software development task.arXiv preprint arXiv:2510.14509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Design2code: Benchmarking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of NAACL, 2025

work page 2025

[17] [17]

Sketch2code: Evaluating vision-language models for interactive web design prototyping

Ryan Li, Yanzhe Zhang, and Diyi Yang. Sketch2code: Evaluating vision-language models for interactive web design prototyping. InProceedings of NAACL, 2025

work page 2025

[18] [18]

Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zixin Wang, Xinyi Xu, Wenxuan Wang, Zhiyao Xu, Yuhang Wang, and Michael R. Lyu. Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. InProceedings of ASE, 2025

work page 2025

[19] [19]

Webuibench: A comprehensive benchmark for evaluating multimodal large language models in webui-to-code

Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, and Xuelong Li. Webuibench: A comprehensive benchmark for evaluating multimodal large language models in webui-to-code. InFindings of ACL, 2025

work page 2025

[20] [20]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

work page 2023

[21] [21]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of ACL, 2024

work page 2024

[22] [22]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of ACL, 2024

work page 2024

[23] [23]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

work page arXiv 2024

[25] [25]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018

Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Ruo Yu Tao, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018

work page arXiv 2018

[27] [27]

Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan

Matthew J. Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Interactive fiction games: A colossal adventure. InAAAI Conference on Artificial Intelligence,

work page

[28] [28]

URLhttps://api.semanticscholar.org/CorpusID:202565447

work page

[29] [29]

Gaina, Julian Togelius, and Si- mon M

Diego Pérez-Liébana, Jialin Liu, Ahmed Khalifa, Raluca D. Gaina, Julian Togelius, and Si- mon M. Lucas. General video game ai: A multitrack framework for evaluating agents, games, and content generation algorithms.IEEE Transactions on Games, 2019. 11

work page 2019

[30] [30]

William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, Manuela Veloso, and Phillip Wang. The minerl 2019 competition on sample efficient re- inforcement learning using human priors. InAdvances in Neural Information Processing Systems, 2019

work page 2019

[31] [31]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

work page arXiv 2024

[32] [32]

Balrog: Benchmarking agentic llm and vlm reasoning on games

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. Balrog: Benchmarking agentic llm and vlm reasoning on games. InInternational Conference on Learning Representations, 2025

work page 2025

[33] [33]

Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang

Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games? arXiv preprint arXiv:2505.15146, 2025

work page arXiv 2025

[34] [34]

Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025

Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, and Marc- Alexandre Côté. Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025

work page arXiv 2025

[35] [35]

Claude opus 4.7

Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 ,

work page

[36] [37]

Claude opus 4.6

Anthropic. Claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 ,

work page

[37] [39]

Claude sonnet 4.5

Anthropic. Claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5 ,

work page

[38] [41]

Introducing gpt-5.5

OpenAI. Introducing gpt-5.5. https://openai.com/index/introducing-gpt-5-5/ ,

work page

[39] [42]

Accessed: 2026-05-07

work page 2026

[40] [43]

Gemini 3.1 pro: A smarter model for your most com- plex tasks

Google Gemini Team. Gemini 3.1 pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/, 2026. Accessed: 2026-05-07

work page 2026

[41] [44]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026

[42] [45]

Kimi k2.6: Scaling agentic intelligence

Moonshot AI. Kimi k2.6: Scaling agentic intelligence. https://www.kimi.com/blog/ kimi-k2-6, 2026. Accessed: 2026-05-07

work page 2026

[43] [46]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [47]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, et al. Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [48]

Z.AI. Glm-5.1. https://docs.z.ai/guides/llm/glm-5.1, 2026. Accessed: 2026-05-07

work page 2026

[46] [49]

Pressing the left arrow moves the player left

Tencent. Tencent unveils hy3 preview; model enhances agent capabilities and real-world usability. https://www.tencent.com/en-us/articles/2202320.html, 2026. Accessed: 2026-05-07. A Specification Construction Details This appendix records construction details that are intentionally compressed in the main text. The main text defines the Structured WebGame S...

work page arXiv 2026

[47] [50]

open the deployed URL with Playwright and handle browser-access issues when needed

work page

[48] [51]

perform a short smoke interaction to check loadability, entry, and the main playable state

work page

[49] [52]

derive checks from the frozen specification and generic playable-loop requirements

work page

[50] [53]

verify checks through user-level actions before relying on source code or logs

work page

[51] [54]

record the pre-trigger state, trigger action, visible result, and relevant state or numeric deltas

work page

[52] [55]

Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works

mark each acceptance item as passed, failed, or unverified. Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works. D.3 Candidate Preconditions For long-horizon, rare, or randomized states, code or runtime-state manipulation may be used to construct a candidate precondition. This operation does not itself verif...

work page

[53] [56]

final conclusion: row or sample id, deployed URL, source reference, quality label, and one-sentence summary

work page

[54] [57]

core functional checks: pass, fail, or unverified for the main playable loop

work page

[55] [58]

other issues: non-core failures and whether they affect the final label

work page

[56] [59]

acceptance results: per-requirement observations and evidence

work page

[57] [60]

E Compute Resources and Trace Statistics All generation and evaluation jobs are executed through API calls on a CPU cluster

unverified items: reason, attempted interactions, and whether review is needed. E Compute Resources and Trace Statistics All generation and evaluation jobs are executed through API calls on a CPU cluster. We do not train models or perform GPU fine-tuning. Agent generation uses concurrency 6, and runtime evaluation uses concurrency 20; each concurrent job ...

work page