WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

Daxiang Dong; Guoliang You; Haoran Wang; Haotian Zhao; Jianmin Wu; Jingnan Gu; Mingyang Dai; Tianlun; Tianshu Zhu; Wenyu Zhang

arxiv: 2605.17637 · v2 · pith:EE2J7NHHnew · submitted 2026-05-17 · 💻 cs.AI

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

Wenyu Zhang , Guoliang You , Tianlun , Haotian Zhao , Tianshu Zhu , Haoran Wang , Xiaoxuan Tang , Mingyang Dai

show 3 more authors

Jingnan Gu Daxiang Dong Jianmin Wu

This is my paper

Pith reviewed 2026-05-25 05:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords WebGameBenchcoding agentsbrowser gamesrequirement-to-applicationruntime evaluationgame specificationsapplication delivery

0 comments

The pith

WebGameBench evaluates coding agents by turning game specifications into browser applications and testing them at runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WebGameBench as a benchmark that measures whether coding agents can produce complete, working browser games directly from frozen structured specifications. Each generated artifact is deployed and served as a live web application, after which a runtime evaluator plays the game in a real browser to assign one of three labels: EXCELLENT, USABLE, or UNUSABLE. Across 111 tasks and multiple agents, the strongest results reach 76.9 percent usable rate yet only 20.2 percent excellent rate. The benchmark validates its automatic labels against human gameplay review on a reviewed subset under the usable-rate criterion. This setup demonstrates that reaching basic playable delivery remains distant from full requirement satisfaction.

Core claim

WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion. It shows that while current coding agents can cross the minimum playable-delivery threshold in many cases, they rarely achieve complete requirement satisfaction as measured by the excellent-rate criterion.

What carries the argument

The runtime evaluator that interacts with the delivered game in a real browser and assigns EXCELLENT, USABLE, or UNUSABLE labels.

If this is right

Application-level runtime evaluation separates coding-agent configurations more clearly than source-code or test-only metrics.
Browser-native games serve as compact testbeds requiring coordinated input handling, spatial rules, state transitions, and feedback.
Crossing the usable threshold does not imply satisfaction of the full specification.
Future agent development must address the observed gap between basic playability and excellent requirement adherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the same deployment-and-play protocol to non-game web applications could test broader requirement-to-application capabilities.
Providing runtime feedback during agent generation might help close the excellent-rate gap.
If the evaluator alignment holds only for simple games, more complex specifications may require additional human checks.

Load-bearing premise

The runtime evaluator's three-way labels accurately reflect whether the delivered game satisfies the original specification, with human alignment on a reviewed subset generalizing to the full set.

What would settle it

A full human review of all generated games that produces usable or excellent rates differing substantially from the runtime evaluator's reported figures would show the labels do not generalize.

Figures

Figures reproduced from arXiv: 2605.17637 by Daxiang Dong, Guoliang You, Haoran Wang, Haotian Zhao, Jianmin Wu, Jingnan Gu, Mingyang Dai, Tianlun, Tianshu Zhu, Wenyu Zhang, Xiaoxuan Tang.

**Figure 2.** Figure 2: Overview of the WebGameBench pipeline. Each task is defined by a frozen Structured [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: WebGameBench corpus profile over 111 browser-native game requirements, including [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Agreement between the runtime evaluator and human-review labels. The first three [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Two representative runtime case studies in the main text. Each task compares Kimi K2.5 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Supplementary runtime case studies for the four tasks omitted from the main text. Each [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE. On a human-reviewed subset, the runtime label is broadly aligned with human gameplay review under the Usable-rate criterion. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, WebGameBench separates current systems: the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate. This gap shows that crossing the minimum playable-delivery threshold is still far from complete requirement satisfaction. To our knowledge, WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebGameBench tests coding agents on delivering actual browser games from specs via runtime browser evaluation, but the human validation only covers usable rate and leaves the excellent-rate gap claim on shaky ground.

read the letter

The main takeaway is a benchmark that shifts evaluation from code or traces to whether the agent produces a working browser game. They convert specs into deployable web apps, run them in a real browser, and have an evaluator assign excellent, usable, or unusable based on interaction. Across 111 tasks and 12 agents the top setup reaches 76.9% usable but only 20.2% excellent, which they read as evidence that playable delivery does not equal full spec compliance. They also check runtime labels against human review on a subset and report broad alignment under the usable-rate criterion. That setup is new for this subfield and the browser-game domain is a sensible compact testbed for input handling, state, and feedback. The attempt to ground the labels in human gameplay is a positive step. The soft spot is the validation itself. Alignment is only reported for usable-rate; no separate human agreement figure is given for the excellent label. The headline result turns on the 20.2% excellent rate being a trustworthy signal of complete requirement satisfaction, so the missing targeted check for that label weakens the gap interpretation. Task construction and evaluator details are also light in the provided text. This is for researchers building or benchmarking coding agents who want application-level tests. It is worth a serious referee to verify the methodology and request the missing excellent-label validation numbers, even if the current evidence for the central claim is limited.

Referee Report

2 major / 2 minor

Summary. The paper introduces WebGameBench, a requirement-to-application benchmark that evaluates coding agents on generating browser-native games from frozen Structured WebGame Specifications. Each artifact is deployed and evaluated in a real browser by a runtime evaluator that assigns EXCELLENT, USABLE, or UNUSABLE labels. On 111 tasks, 12 agents, and 14 configurations, the best result is 76.9% usable rate but only 20.2% excellent rate; the authors interpret the gap as evidence that crossing the playable threshold does not achieve full requirement satisfaction. Human alignment with the runtime labels is reported on a reviewed subset under the Usable-rate criterion.

Significance. If the runtime evaluator's labels, especially EXCELLENT, are shown to be reliable proxies for specification satisfaction, the benchmark would supply a compact, behavior-dense testbed for end-to-end application delivery that is currently missing from coding-agent evaluations. The separation of systems across usable and excellent rates would then constitute a useful, falsifiable signal for tracking progress beyond minimum playable delivery.

major comments (2)

[Abstract / human-alignment description] Abstract and human-alignment paragraph: the central claim that the 20.2% excellent rate demonstrates incomplete requirement satisfaction rests on the EXCELLENT label being a valid proxy for complete specification compliance. The manuscript reports human alignment only “under the Usable-rate criterion” on the reviewed subset and supplies no separate agreement metric or confusion matrix for the EXCELLENT category. This leaves the headline gap interpretation dependent on an untested extrapolation from the usable-rate validation.
[Evaluation protocol] Evaluation protocol section (implied by the three-way labeling description): no details are provided on how the runtime evaluator distinguishes EXCELLENT from USABLE when both pass basic playability, nor on inter-rater reliability or edge-case handling for the EXCELLENT label. Because the 20.2% figure drives the main interpretive claim, the absence of targeted validation for this label is load-bearing.

minor comments (2)

[Abstract] The abstract states alignment “under the Usable-rate criterion” but does not define the exact subset size or selection criteria; adding these numbers would improve reproducibility.
[Results tables] Table or figure reporting the 76.9% / 20.2% figures should include per-agent and per-configuration breakdowns so readers can assess whether the gap is consistent or driven by outliers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional validation and protocol details would strengthen the manuscript. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract / human-alignment description] Abstract and human-alignment paragraph: the central claim that the 20.2% excellent rate demonstrates incomplete requirement satisfaction rests on the EXCELLENT label being a valid proxy for complete specification compliance. The manuscript reports human alignment only “under the Usable-rate criterion” on the reviewed subset and supplies no separate agreement metric or confusion matrix for the EXCELLENT category. This leaves the headline gap interpretation dependent on an untested extrapolation from the usable-rate validation.

Authors: We agree that the reported human alignment is limited to the Usable-rate criterion and that a dedicated metric for the EXCELLENT category would provide stronger grounding for the interpretive claim. The current validation demonstrates broad consistency between runtime labels and human review for playability, but we acknowledge the extrapolation to full requirement satisfaction. In the revised manuscript we will add a separate agreement metric and confusion matrix specifically for the EXCELLENT label on the reviewed subset. revision: yes
Referee: [Evaluation protocol] Evaluation protocol section (implied by the three-way labeling description): no details are provided on how the runtime evaluator distinguishes EXCELLENT from USABLE when both pass basic playability, nor on inter-rater reliability or edge-case handling for the EXCELLENT label. Because the 20.2% figure drives the main interpretive claim, the absence of targeted validation for this label is load-bearing.

Authors: The runtime evaluator is an automated browser-based system that applies deterministic behavioral criteria once basic playability is established. We will expand the Evaluation Protocol section to document the precise decision rules separating EXCELLENT from USABLE, including explicit handling of edge cases such as incomplete visual feedback or non-critical rule deviations. Because the primary evaluator is automated rather than human, traditional inter-rater reliability statistics do not apply; the existing human-alignment study on the reviewed subset serves as the external validation. We will clarify this distinction in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark reports empirical results with external human validation on subset

full rationale

The manuscript introduces WebGameBench as an empirical benchmark evaluating coding agents on browser game delivery tasks. It reports aggregate rates (76.9% usable, 20.2% excellent) across 111 tasks and multiple agents/configurations, plus a statement that runtime labels align with human review under the Usable-rate criterion on a reviewed subset. No equations, fitted parameters, predictions derived from inputs, self-definitional constructs, or load-bearing self-citations appear anywhere in the text. The central gap claim is an empirical observation, not a derivation that reduces to its own inputs by construction. Human alignment is presented as independent corroboration rather than an internal tautology. This is a standard benchmark paper whose results stand or fall on external reproducibility, not circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5826 in / 1001 out tokens · 15549 ms · 2026-05-25T05:43:23.147057+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 11 internal anchors

[1]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke 9 Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, 2025

work page 2025
[3]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, 2024

work page 2024
[4]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

work page 2024
[6]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...

work page 2024
[7]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 2024

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 2024

work page 2024
[10]

Super: Evaluating agents on setting up and executing tasks from research repositories

Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. Super: Evaluating agents on setting up and executing tasks from research repositories. InProceedings of EMNLP, 2024

work page 2024
[11]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research.arXiv preprint arXiv:2504.01848, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. Mle- bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025. 10

work page arXiv 2025
[14]

Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, and Tao Xie. Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

work page arXiv 2026
[15]

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, and Yang Deng. E2edev: Bench- marking large language models in end-to-end software development task.arXiv preprint arXiv:2510.14509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Design2code: Benchmarking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of NAACL, 2025

work page 2025
[17]

Sketch2code: Evaluating vision-language models for interactive web design prototyping

Ryan Li, Yanzhe Zhang, and Diyi Yang. Sketch2code: Evaluating vision-language models for interactive web design prototyping. InProceedings of NAACL, 2025

work page 2025
[18]

Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zixin Wang, Xinyi Xu, Wenxuan Wang, Zhiyao Xu, Yuhang Wang, and Michael R. Lyu. Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. InProceedings of ASE, 2025

work page 2025
[19]

Webuibench: A comprehensive benchmark for evaluating multimodal large language models in webui-to-code

Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, and Xuelong Li. Webuibench: A comprehensive benchmark for evaluating multimodal large language models in webui-to-code. InFindings of ACL, 2025

work page 2025
[20]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

work page 2023
[21]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of ACL, 2024

work page 2024
[22]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of ACL, 2024

work page 2024
[23]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

work page arXiv 2024
[25]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018

Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Ruo Yu Tao, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018

work page arXiv 2018
[27]

Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan

Matthew J. Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Interactive fiction games: A colossal adventure. InAAAI Conference on Artificial Intelligence,

work page
[28]

URLhttps://api.semanticscholar.org/CorpusID:202565447

work page
[29]

Gaina, Julian Togelius, and Si- mon M

Diego Pérez-Liébana, Jialin Liu, Ahmed Khalifa, Raluca D. Gaina, Julian Togelius, and Si- mon M. Lucas. General video game ai: A multitrack framework for evaluating agents, games, and content generation algorithms.IEEE Transactions on Games, 2019. 11

work page 2019
[30]

William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, Manuela Veloso, and Phillip Wang. The minerl 2019 competition on sample efficient re- inforcement learning using human priors. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[31]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

work page arXiv 2024
[32]

Balrog: Benchmarking agentic llm and vlm reasoning on games

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. Balrog: Benchmarking agentic llm and vlm reasoning on games. InInternational Conference on Learning Representations, 2025

work page 2025
[33]

Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang

Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games? arXiv preprint arXiv:2505.15146, 2025

work page arXiv 2025
[34]

Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025

Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, and Marc- Alexandre Côté. Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025

work page arXiv 2025
[35]

Claude opus 4.7

Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 ,

work page
[37]

Claude opus 4.6

Anthropic. Claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 ,

work page
[39]

Claude sonnet 4.5

Anthropic. Claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5 ,

work page
[41]

Introducing gpt-5.5

OpenAI. Introducing gpt-5.5. https://openai.com/index/introducing-gpt-5-5/ ,

work page
[42]

Accessed: 2026-05-07

work page 2026
[43]

Gemini 3.1 pro: A smarter model for your most com- plex tasks

Google Gemini Team. Gemini 3.1 pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/, 2026. Accessed: 2026-05-07

work page 2026
[44]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026
[45]

Kimi k2.6: Scaling agentic intelligence

Moonshot AI. Kimi k2.6: Scaling agentic intelligence. https://www.kimi.com/blog/ kimi-k2-6, 2026. Accessed: 2026-05-07

work page 2026
[46]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, et al. Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Z.AI. Glm-5.1. https://docs.z.ai/guides/llm/glm-5.1, 2026. Accessed: 2026-05-07

work page 2026
[49]

Pressing the left arrow moves the player left

Tencent. Tencent unveils hy3 preview; model enhances agent capabilities and real-world usability. https://www.tencent.com/en-us/articles/2202320.html, 2026. Accessed: 2026-05-07. A Specification Construction Details This appendix records construction details that are intentionally compressed in the main text. The main text defines the Structured WebGame S...

work page arXiv 2026
[50]

open the deployed URL with Playwright and handle browser-access issues when needed

work page
[51]

perform a short smoke interaction to check loadability, entry, and the main playable state

work page
[52]

derive checks from the frozen specification and generic playable-loop requirements

work page
[53]

verify checks through user-level actions before relying on source code or logs

work page
[54]

record the pre-trigger state, trigger action, visible result, and relevant state or numeric deltas

work page
[55]

Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works

mark each acceptance item as passed, failed, or unverified. Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works. D.3 Candidate Preconditions For long-horizon, rare, or randomized states, code or runtime-state manipulation may be used to construct a candidate precondition. This operation does not itself verif...

work page
[56]

final conclusion: row or sample id, deployed URL, source reference, quality label, and one-sentence summary

work page
[57]

core functional checks: pass, fail, or unverified for the main playable loop

work page
[58]

other issues: non-core failures and whether they affect the final label

work page
[59]

acceptance results: per-requirement observations and evidence

work page
[60]

E Compute Resources and Trace Statistics All generation and evaluation jobs are executed through API calls on a CPU cluster

unverified items: reason, attempted interactions, and whether review is needed. E Compute Resources and Trace Statistics All generation and evaluation jobs are executed through API calls on a CPU cluster. We do not train models or perform GPU fine-tuning. Agent generation uses concurrency 6, and runtime evaluation uses concurrency 20; each concurrent job ...

work page

[1] [1]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke 9 Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, 2025

work page 2025

[3] [3]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, 2024

work page 2024

[4] [4]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

work page 2024

[6] [6]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...

work page 2024

[7] [7]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 2024

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 2024

work page 2024

[10] [10]

Super: Evaluating agents on setting up and executing tasks from research repositories

Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. Super: Evaluating agents on setting up and executing tasks from research repositories. InProceedings of EMNLP, 2024

work page 2024

[11] [11]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research.arXiv preprint arXiv:2504.01848, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. Mle- bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025. 10

work page arXiv 2025

[14] [14]

Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, and Tao Xie. Webcoderbench: Benchmarking web application generation with comprehensive and interpretable evaluation metrics.arXiv preprint arXiv:2601.02430, 2026

work page arXiv 2026

[15] [15]

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, and Yang Deng. E2edev: Bench- marking large language models in end-to-end software development task.arXiv preprint arXiv:2510.14509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Design2code: Benchmarking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of NAACL, 2025

work page 2025

[17] [17]

Sketch2code: Evaluating vision-language models for interactive web design prototyping

Ryan Li, Yanzhe Zhang, and Diyi Yang. Sketch2code: Evaluating vision-language models for interactive web design prototyping. InProceedings of NAACL, 2025

work page 2025

[18] [18]

Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zixin Wang, Xinyi Xu, Wenxuan Wang, Zhiyao Xu, Yuhang Wang, and Michael R. Lyu. Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. InProceedings of ASE, 2025

work page 2025

[19] [19]

Webuibench: A comprehensive benchmark for evaluating multimodal large language models in webui-to-code

Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, and Xuelong Li. Webuibench: A comprehensive benchmark for evaluating multimodal large language models in webui-to-code. InFindings of ACL, 2025

work page 2025

[20] [20]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

work page 2023

[21] [21]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of ACL, 2024

work page 2024

[22] [22]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of ACL, 2024

work page 2024

[23] [23]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

work page arXiv 2024

[25] [25]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018

Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Ruo Yu Tao, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. Textworld: A learning environment for text-based games.arXiv preprint arXiv:1806.11532, 2018

work page arXiv 2018

[27] [27]

Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan

Matthew J. Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Interactive fiction games: A colossal adventure. InAAAI Conference on Artificial Intelligence,

work page

[28] [28]

URLhttps://api.semanticscholar.org/CorpusID:202565447

work page

[29] [29]

Gaina, Julian Togelius, and Si- mon M

Diego Pérez-Liébana, Jialin Liu, Ahmed Khalifa, Raluca D. Gaina, Julian Togelius, and Si- mon M. Lucas. General video game ai: A multitrack framework for evaluating agents, games, and content generation algorithms.IEEE Transactions on Games, 2019. 11

work page 2019

[30] [30]

William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, Manuela Veloso, and Phillip Wang. The minerl 2019 competition on sample efficient re- inforcement learning using human priors. InAdvances in Neural Information Processing Systems, 2019

work page 2019

[31] [31]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

work page arXiv 2024

[32] [32]

Balrog: Benchmarking agentic llm and vlm reasoning on games

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. Balrog: Benchmarking agentic llm and vlm reasoning on games. InInternational Conference on Learning Representations, 2025

work page 2025

[33] [33]

Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang

Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games? arXiv preprint arXiv:2505.15146, 2025

work page arXiv 2025

[34] [34]

Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025

Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, and Marc- Alexandre Côté. Tales: Text adventure learning environment suite.arXiv preprint arXiv:2504.14128, 2025

work page arXiv 2025

[35] [35]

Claude opus 4.7

Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 ,

work page

[36] [37]

Claude opus 4.6

Anthropic. Claude opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 ,

work page

[37] [39]

Claude sonnet 4.5

Anthropic. Claude sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5 ,

work page

[38] [41]

Introducing gpt-5.5

OpenAI. Introducing gpt-5.5. https://openai.com/index/introducing-gpt-5-5/ ,

work page

[39] [42]

Accessed: 2026-05-07

work page 2026

[40] [43]

Gemini 3.1 pro: A smarter model for your most com- plex tasks

Google Gemini Team. Gemini 3.1 pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/, 2026. Accessed: 2026-05-07

work page 2026

[41] [44]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026

[42] [45]

Kimi k2.6: Scaling agentic intelligence

Moonshot AI. Kimi k2.6: Scaling agentic intelligence. https://www.kimi.com/blog/ kimi-k2-6, 2026. Accessed: 2026-05-07

work page 2026

[43] [46]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [47]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, et al. Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [48]

Z.AI. Glm-5.1. https://docs.z.ai/guides/llm/glm-5.1, 2026. Accessed: 2026-05-07

work page 2026

[46] [49]

Pressing the left arrow moves the player left

Tencent. Tencent unveils hy3 preview; model enhances agent capabilities and real-world usability. https://www.tencent.com/en-us/articles/2202320.html, 2026. Accessed: 2026-05-07. A Specification Construction Details This appendix records construction details that are intentionally compressed in the main text. The main text defines the Structured WebGame S...

work page arXiv 2026

[47] [50]

open the deployed URL with Playwright and handle browser-access issues when needed

work page

[48] [51]

perform a short smoke interaction to check loadability, entry, and the main playable state

work page

[49] [52]

derive checks from the frozen specification and generic playable-loop requirements

work page

[50] [53]

verify checks through user-level actions before relying on source code or logs

work page

[51] [54]

record the pre-trigger state, trigger action, visible result, and relevant state or numeric deltas

work page

[52] [55]

Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works

mark each acceptance item as passed, failed, or unverified. Source code may explain a failure, but it cannot by itself prove that a user-visible behavior works. D.3 Candidate Preconditions For long-horizon, rare, or randomized states, code or runtime-state manipulation may be used to construct a candidate precondition. This operation does not itself verif...

work page

[53] [56]

final conclusion: row or sample id, deployed URL, source reference, quality label, and one-sentence summary

work page

[54] [57]

core functional checks: pass, fail, or unverified for the main playable loop

work page

[55] [58]

other issues: non-core failures and whether they affect the final label

work page

[56] [59]

acceptance results: per-requirement observations and evidence

work page

[57] [60]

E Compute Resources and Trace Statistics All generation and evaluation jobs are executed through API calls on a CPU cluster

unverified items: reason, attempted interactions, and whether review is needed. E Compute Resources and Trace Statistics All generation and evaluation jobs are executed through API calls on a CPU cluster. We do not train models or perform GPU fine-tuning. Agent generation uses concurrency 6, and runtime evaluation uses concurrency 20; each concurrent job ...

work page