WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
Pith reviewed 2026-05-25 05:43 UTC · model grok-4.3
The pith
WebGameBench evaluates coding agents by turning game specifications into browser applications and testing them at runtime.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion. It shows that while current coding agents can cross the minimum playable-delivery threshold in many cases, they rarely achieve complete requirement satisfaction as measured by the excellent-rate criterion.
What carries the argument
The runtime evaluator that interacts with the delivered game in a real browser and assigns EXCELLENT, USABLE, or UNUSABLE labels.
If this is right
- Application-level runtime evaluation separates coding-agent configurations more clearly than source-code or test-only metrics.
- Browser-native games serve as compact testbeds requiring coordinated input handling, spatial rules, state transitions, and feedback.
- Crossing the usable threshold does not imply satisfaction of the full specification.
- Future agent development must address the observed gap between basic playability and excellent requirement adherence.
Where Pith is reading between the lines
- Extending the same deployment-and-play protocol to non-game web applications could test broader requirement-to-application capabilities.
- Providing runtime feedback during agent generation might help close the excellent-rate gap.
- If the evaluator alignment holds only for simple games, more complex specifications may require additional human checks.
Load-bearing premise
The runtime evaluator's three-way labels accurately reflect whether the delivered game satisfies the original specification, with human alignment on a reviewed subset generalizing to the full set.
What would settle it
A full human review of all generated games that produces usable or excellent rates differing substantially from the runtime evaluator's reported figures would show the labels do not generalize.
Figures
read the original abstract
Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE. On a human-reviewed subset, the runtime label is broadly aligned with human gameplay review under the Usable-rate criterion. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, WebGameBench separates current systems: the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate. This gap shows that crossing the minimum playable-delivery threshold is still far from complete requirement satisfaction. To our knowledge, WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WebGameBench, a requirement-to-application benchmark that evaluates coding agents on generating browser-native games from frozen Structured WebGame Specifications. Each artifact is deployed and evaluated in a real browser by a runtime evaluator that assigns EXCELLENT, USABLE, or UNUSABLE labels. On 111 tasks, 12 agents, and 14 configurations, the best result is 76.9% usable rate but only 20.2% excellent rate; the authors interpret the gap as evidence that crossing the playable threshold does not achieve full requirement satisfaction. Human alignment with the runtime labels is reported on a reviewed subset under the Usable-rate criterion.
Significance. If the runtime evaluator's labels, especially EXCELLENT, are shown to be reliable proxies for specification satisfaction, the benchmark would supply a compact, behavior-dense testbed for end-to-end application delivery that is currently missing from coding-agent evaluations. The separation of systems across usable and excellent rates would then constitute a useful, falsifiable signal for tracking progress beyond minimum playable delivery.
major comments (2)
- [Abstract / human-alignment description] Abstract and human-alignment paragraph: the central claim that the 20.2% excellent rate demonstrates incomplete requirement satisfaction rests on the EXCELLENT label being a valid proxy for complete specification compliance. The manuscript reports human alignment only “under the Usable-rate criterion” on the reviewed subset and supplies no separate agreement metric or confusion matrix for the EXCELLENT category. This leaves the headline gap interpretation dependent on an untested extrapolation from the usable-rate validation.
- [Evaluation protocol] Evaluation protocol section (implied by the three-way labeling description): no details are provided on how the runtime evaluator distinguishes EXCELLENT from USABLE when both pass basic playability, nor on inter-rater reliability or edge-case handling for the EXCELLENT label. Because the 20.2% figure drives the main interpretive claim, the absence of targeted validation for this label is load-bearing.
minor comments (2)
- [Abstract] The abstract states alignment “under the Usable-rate criterion” but does not define the exact subset size or selection criteria; adding these numbers would improve reproducibility.
- [Results tables] Table or figure reporting the 76.9% / 20.2% figures should include per-agent and per-configuration breakdowns so readers can assess whether the gap is consistent or driven by outliers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify key areas where additional validation and protocol details would strengthen the manuscript. We address each point below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Abstract / human-alignment description] Abstract and human-alignment paragraph: the central claim that the 20.2% excellent rate demonstrates incomplete requirement satisfaction rests on the EXCELLENT label being a valid proxy for complete specification compliance. The manuscript reports human alignment only “under the Usable-rate criterion” on the reviewed subset and supplies no separate agreement metric or confusion matrix for the EXCELLENT category. This leaves the headline gap interpretation dependent on an untested extrapolation from the usable-rate validation.
Authors: We agree that the reported human alignment is limited to the Usable-rate criterion and that a dedicated metric for the EXCELLENT category would provide stronger grounding for the interpretive claim. The current validation demonstrates broad consistency between runtime labels and human review for playability, but we acknowledge the extrapolation to full requirement satisfaction. In the revised manuscript we will add a separate agreement metric and confusion matrix specifically for the EXCELLENT label on the reviewed subset. revision: yes
-
Referee: [Evaluation protocol] Evaluation protocol section (implied by the three-way labeling description): no details are provided on how the runtime evaluator distinguishes EXCELLENT from USABLE when both pass basic playability, nor on inter-rater reliability or edge-case handling for the EXCELLENT label. Because the 20.2% figure drives the main interpretive claim, the absence of targeted validation for this label is load-bearing.
Authors: The runtime evaluator is an automated browser-based system that applies deterministic behavioral criteria once basic playability is established. We will expand the Evaluation Protocol section to document the precise decision rules separating EXCELLENT from USABLE, including explicit handling of edge cases such as incomplete visual feedback or non-critical rule deviations. Because the primary evaluator is automated rather than human, traditional inter-rater reliability statistics do not apply; the existing human-alignment study on the reviewed subset serves as the external validation. We will clarify this distinction in the revision. revision: yes
Circularity Check
No circularity; benchmark reports empirical results with external human validation on subset
full rationale
The manuscript introduces WebGameBench as an empirical benchmark evaluating coding agents on browser game delivery tasks. It reports aggregate rates (76.9% usable, 20.2% excellent) across 111 tasks and multiple agents/configurations, plus a statement that runtime labels align with human review under the Usable-rate criterion on a reviewed subset. No equations, fitted parameters, predictions derived from inputs, self-definitional constructs, or load-bearing self-citations appear anywhere in the text. The central gap claim is an empirical observation, not a derivation that reduces to its own inputs by construction. Human alignment is presented as independent corroboration rather than an internal tautology. This is a standard benchmark paper whose results stand or fall on external reproducibility, not circular reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke 9 Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, 2025
work page 2025
-
[3]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, 2024
work page 2024
-
[4]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024
work page 2024
-
[6]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...
work page 2024
-
[7]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Measuring Coding Challenge Competence With APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 2024
work page 2024
-
[10]
Super: Evaluating agents on setting up and executing tasks from research repositories
Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. Super: Evaluating agents on setting up and executing tasks from research repositories. InProceedings of EMNLP, 2024
work page 2024
-
[11]
PaperBench: Evaluating AI's Ability to Replicate AI Research
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research.arXiv preprint arXiv:2504.01848, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. Mle- bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025. 10
-
[14]
Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, and Tao Xie. Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026
-
[15]
E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, and Yang Deng. E2edev: Bench- marking large language models in end-to-end software development task.arXiv preprint arXiv:2510.14509, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Design2code: Benchmarking multimodal code generation for automated front-end engineering
Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of NAACL, 2025
work page 2025
-
[17]
Sketch2code: Evaluating vision-language models for interactive web design prototyping
Ryan Li, Yanzhe Zhang, and Diyi Yang. Sketch2code: Evaluating vision-language models for interactive web design prototyping. InProceedings of NAACL, 2025
work page 2025
-
[18]
Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zixin Wang, Xinyi Xu, Wenxuan Wang, Zhiyao Xu, Yuhang Wang, and Michael R. Lyu. Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. InProceedings of ASE, 2025
work page 2025
-
[19]
Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, and Xuelong Li. Webuibench: A comprehensive benchmark for evaluating multimodal large language models in webui-to-code. InFindings of ACL, 2025
work page 2025
-
[20]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023
work page 2023
-
[21]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of ACL, 2024
work page 2024
-
[22]
Webvoyager: Building an end-to-end web agent with large multimodal models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of ACL, 2024
work page 2024
-
[23]
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...
-
[25]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018
Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Ruo Yu Tao, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018
-
[27]
Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan
Matthew J. Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Interactive fiction games: A colossal adventure. InAAAI Conference on Artificial Intelligence,
-
[28]
URLhttps://api.semanticscholar.org/CorpusID:202565447
-
[29]
Gaina, Julian Togelius, and Si- mon M
Diego Pérez-Liébana, Jialin Liu, Ahmed Khalifa, Raluca D. Gaina, Julian Togelius, and Si- mon M. Lucas. General video game ai: A multitrack framework for evaluating agents, games, and content generation algorithms.IEEE Transactions on Games, 2019. 11
work page 2019
-
[30]
William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, Manuela Veloso, and Phillip Wang. The minerl 2019 competition on sample efficient re- inforcement learning using human priors. InAdvances in Neural Information Processing Systems, 2019
work page 2019
-
[31]
Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024
-
[32]
Balrog: Benchmarking agentic llm and vlm reasoning on games
Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. Balrog: Benchmarking agentic llm and vlm reasoning on games. InInternational Conference on Learning Representations, 2025
work page 2025
-
[33]
Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang
Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games? arXiv preprint arXiv:2505.15146, 2025
-
[34]
Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025
Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, and Marc- Alexandre Côté. Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025
-
[35]
Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 ,
-
[37]
Anthropic. Claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 ,
-
[39]
Anthropic. Claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5 ,
-
[41]
OpenAI. Introducing gpt-5.5. https://openai.com/index/introducing-gpt-5-5/ ,
-
[42]
Accessed: 2026-05-07
work page 2026
-
[43]
Gemini 3.1 pro: A smarter model for your most com- plex tasks
Google Gemini Team. Gemini 3.1 pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/, 2026. Accessed: 2026-05-07
work page 2026
-
[44]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
work page 2026
-
[45]
Kimi k2.6: Scaling agentic intelligence
Moonshot AI. Kimi k2.6: Scaling agentic intelligence. https://www.kimi.com/blog/ kimi-k2-6, 2026. Accessed: 2026-05-07
work page 2026
-
[46]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, et al. Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Z.AI. Glm-5.1. https://docs.z.ai/guides/llm/glm-5.1, 2026. Accessed: 2026-05-07
work page 2026
-
[49]
Pressing the left arrow moves the player left
Tencent. Tencent unveils hy3 preview; model enhances agent capabilities and real-world usability. https://www.tencent.com/en-us/articles/2202320.html, 2026. Accessed: 2026-05-07. A Specification Construction Details This appendix records construction details that are intentionally compressed in the main text. The main text defines the Structured WebGame S...
-
[50]
open the deployed URL with Playwright and handle browser-access issues when needed
-
[51]
perform a short smoke interaction to check loadability, entry, and the main playable state
-
[52]
derive checks from the frozen specification and generic playable-loop requirements
-
[53]
verify checks through user-level actions before relying on source code or logs
-
[54]
record the pre-trigger state, trigger action, visible result, and relevant state or numeric deltas
-
[55]
Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works
mark each acceptance item as passed, failed, or unverified. Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works. D.3 Candidate Preconditions For long-horizon, rare, or randomized states, code or runtime-state manipulation may be used to construct a candidate precondition. This operation does not itself verif...
-
[56]
final conclusion: row or sample id, deployed URL, source reference, quality label, and one-sentence summary
-
[57]
core functional checks: pass, fail, or unverified for the main playable loop
-
[58]
other issues: non-core failures and whether they affect the final label
-
[59]
acceptance results: per-requirement observations and evidence
-
[60]
unverified items: reason, attempted interactions, and whether review is needed. E Compute Resources and Trace Statistics All generation and evaluation jobs are executed through API calls on a CPU cluster. We do not train models or perform GPU fine-tuning. Agent generation uses concurrency 6, and runtime evaluation uses concurrency 20; each concurrent job ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.