pith. sign in

arxiv: 2605.17637 · v2 · pith:EE2J7NHHnew · submitted 2026-05-17 · 💻 cs.AI

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

Pith reviewed 2026-05-25 05:43 UTC · model grok-4.3

classification 💻 cs.AI
keywords WebGameBenchcoding agentsbrowser gamesrequirement-to-applicationruntime evaluationgame specificationsapplication delivery
0
0 comments X

The pith

WebGameBench evaluates coding agents by turning game specifications into browser applications and testing them at runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WebGameBench as a benchmark that measures whether coding agents can produce complete, working browser games directly from frozen structured specifications. Each generated artifact is deployed and served as a live web application, after which a runtime evaluator plays the game in a real browser to assign one of three labels: EXCELLENT, USABLE, or UNUSABLE. Across 111 tasks and multiple agents, the strongest results reach 76.9 percent usable rate yet only 20.2 percent excellent rate. The benchmark validates its automatic labels against human gameplay review on a reviewed subset under the usable-rate criterion. This setup demonstrates that reaching basic playable delivery remains distant from full requirement satisfaction.

Core claim

WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion. It shows that while current coding agents can cross the minimum playable-delivery threshold in many cases, they rarely achieve complete requirement satisfaction as measured by the excellent-rate criterion.

What carries the argument

The runtime evaluator that interacts with the delivered game in a real browser and assigns EXCELLENT, USABLE, or UNUSABLE labels.

If this is right

  • Application-level runtime evaluation separates coding-agent configurations more clearly than source-code or test-only metrics.
  • Browser-native games serve as compact testbeds requiring coordinated input handling, spatial rules, state transitions, and feedback.
  • Crossing the usable threshold does not imply satisfaction of the full specification.
  • Future agent development must address the observed gap between basic playability and excellent requirement adherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the same deployment-and-play protocol to non-game web applications could test broader requirement-to-application capabilities.
  • Providing runtime feedback during agent generation might help close the excellent-rate gap.
  • If the evaluator alignment holds only for simple games, more complex specifications may require additional human checks.

Load-bearing premise

The runtime evaluator's three-way labels accurately reflect whether the delivered game satisfies the original specification, with human alignment on a reviewed subset generalizing to the full set.

What would settle it

A full human review of all generated games that produces usable or excellent rates differing substantially from the runtime evaluator's reported figures would show the labels do not generalize.

Figures

Figures reproduced from arXiv: 2605.17637 by Daxiang Dong, Guoliang You, Haoran Wang, Haotian Zhao, Jianmin Wu, Jingnan Gu, Mingyang Dai, Tianlun, Tianshu Zhu, Wenyu Zhang, Xiaoxuan Tang.

Figure 1
Figure 1. Figure 1: Pilot study on browser-native artifacts. We use ‘Opus 4.6’ to compare H5, Tool, Web, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the WebGameBench pipeline. Each task is defined by a frozen Structured [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: WebGameBench corpus profile over 111 browser-native game requirements, including [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Agreement between the runtime evaluator and human-review labels. The first three [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Two representative runtime case studies in the main text. Each task compares Kimi K2.5 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Supplementary runtime case studies for the four tasks omitted from the main text. Each [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE. On a human-reviewed subset, the runtime label is broadly aligned with human gameplay review under the Usable-rate criterion. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, WebGameBench separates current systems: the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate. This gap shows that crossing the minimum playable-delivery threshold is still far from complete requirement satisfaction. To our knowledge, WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces WebGameBench, a requirement-to-application benchmark that evaluates coding agents on generating browser-native games from frozen Structured WebGame Specifications. Each artifact is deployed and evaluated in a real browser by a runtime evaluator that assigns EXCELLENT, USABLE, or UNUSABLE labels. On 111 tasks, 12 agents, and 14 configurations, the best result is 76.9% usable rate but only 20.2% excellent rate; the authors interpret the gap as evidence that crossing the playable threshold does not achieve full requirement satisfaction. Human alignment with the runtime labels is reported on a reviewed subset under the Usable-rate criterion.

Significance. If the runtime evaluator's labels, especially EXCELLENT, are shown to be reliable proxies for specification satisfaction, the benchmark would supply a compact, behavior-dense testbed for end-to-end application delivery that is currently missing from coding-agent evaluations. The separation of systems across usable and excellent rates would then constitute a useful, falsifiable signal for tracking progress beyond minimum playable delivery.

major comments (2)
  1. [Abstract / human-alignment description] Abstract and human-alignment paragraph: the central claim that the 20.2% excellent rate demonstrates incomplete requirement satisfaction rests on the EXCELLENT label being a valid proxy for complete specification compliance. The manuscript reports human alignment only “under the Usable-rate criterion” on the reviewed subset and supplies no separate agreement metric or confusion matrix for the EXCELLENT category. This leaves the headline gap interpretation dependent on an untested extrapolation from the usable-rate validation.
  2. [Evaluation protocol] Evaluation protocol section (implied by the three-way labeling description): no details are provided on how the runtime evaluator distinguishes EXCELLENT from USABLE when both pass basic playability, nor on inter-rater reliability or edge-case handling for the EXCELLENT label. Because the 20.2% figure drives the main interpretive claim, the absence of targeted validation for this label is load-bearing.
minor comments (2)
  1. [Abstract] The abstract states alignment “under the Usable-rate criterion” but does not define the exact subset size or selection criteria; adding these numbers would improve reproducibility.
  2. [Results tables] Table or figure reporting the 76.9% / 20.2% figures should include per-agent and per-configuration breakdowns so readers can assess whether the gap is consistent or driven by outliers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional validation and protocol details would strengthen the manuscript. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract / human-alignment description] Abstract and human-alignment paragraph: the central claim that the 20.2% excellent rate demonstrates incomplete requirement satisfaction rests on the EXCELLENT label being a valid proxy for complete specification compliance. The manuscript reports human alignment only “under the Usable-rate criterion” on the reviewed subset and supplies no separate agreement metric or confusion matrix for the EXCELLENT category. This leaves the headline gap interpretation dependent on an untested extrapolation from the usable-rate validation.

    Authors: We agree that the reported human alignment is limited to the Usable-rate criterion and that a dedicated metric for the EXCELLENT category would provide stronger grounding for the interpretive claim. The current validation demonstrates broad consistency between runtime labels and human review for playability, but we acknowledge the extrapolation to full requirement satisfaction. In the revised manuscript we will add a separate agreement metric and confusion matrix specifically for the EXCELLENT label on the reviewed subset. revision: yes

  2. Referee: [Evaluation protocol] Evaluation protocol section (implied by the three-way labeling description): no details are provided on how the runtime evaluator distinguishes EXCELLENT from USABLE when both pass basic playability, nor on inter-rater reliability or edge-case handling for the EXCELLENT label. Because the 20.2% figure drives the main interpretive claim, the absence of targeted validation for this label is load-bearing.

    Authors: The runtime evaluator is an automated browser-based system that applies deterministic behavioral criteria once basic playability is established. We will expand the Evaluation Protocol section to document the precise decision rules separating EXCELLENT from USABLE, including explicit handling of edge cases such as incomplete visual feedback or non-critical rule deviations. Because the primary evaluator is automated rather than human, traditional inter-rater reliability statistics do not apply; the existing human-alignment study on the reviewed subset serves as the external validation. We will clarify this distinction in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark reports empirical results with external human validation on subset

full rationale

The manuscript introduces WebGameBench as an empirical benchmark evaluating coding agents on browser game delivery tasks. It reports aggregate rates (76.9% usable, 20.2% excellent) across 111 tasks and multiple agents/configurations, plus a statement that runtime labels align with human review under the Usable-rate criterion on a reviewed subset. No equations, fitted parameters, predictions derived from inputs, self-definitional constructs, or load-bearing self-citations appear anywhere in the text. The central gap claim is an empirical observation, not a derivation that reduces to its own inputs by construction. Human alignment is presented as independent corroboration rather than an internal tautology. This is a standard benchmark paper whose results stand or fall on external reproducibility, not circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5826 in / 1001 out tokens · 15549 ms · 2026-05-25T05:43:23.147057+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 11 internal anchors

  1. [1]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke 9 Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...

  2. [2]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, 2025

  3. [3]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, 2024

  4. [4]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

  5. [5]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

  6. [6]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...

  7. [7]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  8. [8]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

  9. [9]

    Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 2024

    Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 2024

  10. [10]

    Super: Evaluating agents on setting up and executing tasks from research repositories

    Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. Super: Evaluating agents on setting up and executing tasks from research repositories. InProceedings of EMNLP, 2024

  11. [11]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research.arXiv preprint arXiv:2504.01848, 2025

  12. [12]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. Mle- bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

  13. [13]

    Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

    Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025. 10

  14. [14]

    Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

    Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, and Tao Xie. Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

  15. [15]

    E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

    Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, and Yang Deng. E2edev: Bench- marking large language models in end-to-end software development task.arXiv preprint arXiv:2510.14509, 2025

  16. [16]

    Design2code: Benchmarking multimodal code generation for automated front-end engineering

    Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of NAACL, 2025

  17. [17]

    Sketch2code: Evaluating vision-language models for interactive web design prototyping

    Ryan Li, Yanzhe Zhang, and Diyi Yang. Sketch2code: Evaluating vision-language models for interactive web design prototyping. InProceedings of NAACL, 2025

  18. [18]

    Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zixin Wang, Xinyi Xu, Wenxuan Wang, Zhiyao Xu, Yuhang Wang, and Michael R. Lyu. Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. InProceedings of ASE, 2025

  19. [19]

    Webuibench: A comprehensive benchmark for evaluating multimodal large language models in webui-to-code

    Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, and Xuelong Li. Webuibench: A comprehensive benchmark for evaluating multimodal large language models in webui-to-code. InFindings of ACL, 2025

  20. [20]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

  21. [21]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of ACL, 2024

  22. [22]

    Webvoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of ACL, 2024

  23. [23]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

  24. [24]

    Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

    Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

  25. [25]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

  26. [26]

    Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018

    Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Ruo Yu Tao, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018

  27. [27]

    Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan

    Matthew J. Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Interactive fiction games: A colossal adventure. InAAAI Conference on Artificial Intelligence,

  28. [28]

    URLhttps://api.semanticscholar.org/CorpusID:202565447

  29. [29]

    Gaina, Julian Togelius, and Si- mon M

    Diego Pérez-Liébana, Jialin Liu, Ahmed Khalifa, Raluca D. Gaina, Julian Togelius, and Si- mon M. Lucas. General video game ai: A multitrack framework for evaluating agents, games, and content generation algorithms.IEEE Transactions on Games, 2019. 11

  30. [30]

    William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, Manuela Veloso, and Phillip Wang. The minerl 2019 competition on sample efficient re- inforcement learning using human priors. InAdvances in Neural Information Processing Systems, 2019

  31. [31]

    Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

    Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

  32. [32]

    Balrog: Benchmarking agentic llm and vlm reasoning on games

    Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. Balrog: Benchmarking agentic llm and vlm reasoning on games. InInternational Conference on Learning Representations, 2025

  33. [33]

    Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games? arXiv preprint arXiv:2505.15146, 2025

  34. [34]

    Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025

    Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, and Marc- Alexandre Côté. Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025

  35. [35]

    Claude opus 4.7

    Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 ,

  36. [37]

    Claude opus 4.6

    Anthropic. Claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 ,

  37. [39]

    Claude sonnet 4.5

    Anthropic. Claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5 ,

  38. [41]

    Introducing gpt-5.5

    OpenAI. Introducing gpt-5.5. https://openai.com/index/introducing-gpt-5-5/ ,

  39. [42]

    Accessed: 2026-05-07

  40. [43]

    Gemini 3.1 pro: A smarter model for your most com- plex tasks

    Google Gemini Team. Gemini 3.1 pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/, 2026. Accessed: 2026-05-07

  41. [44]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  42. [45]

    Kimi k2.6: Scaling agentic intelligence

    Moonshot AI. Kimi k2.6: Scaling agentic intelligence. https://www.kimi.com/blog/ kimi-k2-6, 2026. Accessed: 2026-05-07

  43. [46]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  44. [47]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, et al. Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

  45. [48]

    Z.AI. Glm-5.1. https://docs.z.ai/guides/llm/glm-5.1, 2026. Accessed: 2026-05-07

  46. [49]

    Pressing the left arrow moves the player left

    Tencent. Tencent unveils hy3 preview; model enhances agent capabilities and real-world usability. https://www.tencent.com/en-us/articles/2202320.html, 2026. Accessed: 2026-05-07. A Specification Construction Details This appendix records construction details that are intentionally compressed in the main text. The main text defines the Structured WebGame S...

  47. [50]

    open the deployed URL with Playwright and handle browser-access issues when needed

  48. [51]

    perform a short smoke interaction to check loadability, entry, and the main playable state

  49. [52]

    derive checks from the frozen specification and generic playable-loop requirements

  50. [53]

    verify checks through user-level actions before relying on source code or logs

  51. [54]

    record the pre-trigger state, trigger action, visible result, and relevant state or numeric deltas

  52. [55]

    Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works

    mark each acceptance item as passed, failed, or unverified. Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works. D.3 Candidate Preconditions For long-horizon, rare, or randomized states, code or runtime-state manipulation may be used to construct a candidate precondition. This operation does not itself verif...

  53. [56]

    final conclusion: row or sample id, deployed URL, source reference, quality label, and one-sentence summary

  54. [57]

    core functional checks: pass, fail, or unverified for the main playable loop

  55. [58]

    other issues: non-core failures and whether they affect the final label

  56. [59]

    acceptance results: per-requirement observations and evidence

  57. [60]

    E Compute Resources and Trace Statistics All generation and evaluation jobs are executed through API calls on a CPU cluster

    unverified items: reason, attempted interactions, and whether review is needed. E Compute Resources and Trace Statistics All generation and evaluation jobs are executed through API calls on a CPU cluster. We do not train models or perform GPU fine-tuning. Agent generation uses concurrency 6, and runtime evaluation uses concurrency 20; each concurrent job ...