pith. machine review for the scientific record. sign in

arxiv: 2604.07429 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.AI· cs.HC

Recognition: no theorem link

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.HC
keywords multimodal agentsgame agentsbenchmarkvideo gamesevaluation frameworkMLLMcomputer-use agentssemantic action parsing
0
0 comments X

The pith

GameWorld benchmark shows even top multimodal AI agents fall far short of human performance on 34 video games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GameWorld to create a standardized, reproducible way to test multimodal large language models as game agents inside browser environments. It defines two clear interfaces for agents—one that emits raw keyboard and mouse actions and another that uses deterministic semantic action parsing—and supplies 34 games containing 170 tasks, each equipped with objective state-verifiable success metrics. Evaluation of 18 model-interface pairs finds that the strongest current agents remain well below human levels in perception, long-horizon planning, and precise control. Video games serve as a closed-loop testbed that captures the latency, sparse feedback, and irreversible mistakes that embodied agents must handle in real settings. The work therefore supplies both a measurement tool and evidence of the distance still to be covered.

Core claim

GameWorld supplies a benchmark of 34 diverse games and 170 tasks inside browser environments together with two standardized agent interfaces—direct keyboard-and-mouse control and semantic action parsing—and demonstrates through repeated evaluation of 18 model-interface combinations that current multimodal agents remain far from human capabilities while exposing specific challenges in real-time interaction, memory use, and action validity.

What carries the argument

GameWorld benchmark: 34 games paired with 170 state-verifiable tasks, supporting two agent interfaces (direct control and semantic action parsing) inside a browser environment for outcome-based scoring.

If this is right

  • Agents must improve handling of latency, sparse rewards, and irreversible errors within closed interaction loops.
  • Semantic action parsing offers a cleaner interface than raw controls for generalist multimodal models.
  • Repeated full-benchmark reruns provide a stable baseline for tracking future progress.
  • Targeted studies on context memory and real-time constraints identify concrete bottlenecks for agent design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark's verifiable metrics could support automated training loops that let agents improve through repeated self-play.
  • If the games capture core requirements of embodied interaction, similar evaluation pipelines might transfer to robotic or simulation-based tasks.
  • Persistent gaps suggest that simply scaling current models will not close the distance without new mechanisms for long-horizon planning and fine motor control.

Load-bearing premise

The chosen 34 games, 170 tasks, browser setting, and semantic action parsing together constitute a representative test of general multimodal agent abilities without large interface biases or gaps in real-world interaction challenges.

What would settle it

A follow-up run in which the highest-scoring agent reaches human-comparable success rates on at least 80 percent of the 170 tasks across multiple full-benchmark evaluations would indicate the performance gap is closing; consistent sub-human results despite new models would indicate the gap persists.

read the original abstract

Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GameWorld, a benchmark for standardized evaluation of multimodal LLM agents as game players in browser environments. It covers 34 games and 170 tasks with state-verifiable outcome metrics, studies two interfaces (direct computer-use via keyboard/mouse emission and semantic actions via deterministic parsing), evaluates 18 model-interface pairs, and reports that even the strongest agents remain far below human performance levels. Additional experiments address benchmark robustness via repeated reruns, real-time interaction, context-memory sensitivity, and action validity.

Significance. If the human comparisons and interface controls are fairly matched, GameWorld supplies a reproducible, verifiable testbed that directly targets the perception-planning-control loop required for embodied agents. The use of closed-loop browser environments, deterministic parsing, and outcome-based verification addresses common heterogeneity problems in game-agent evaluation and provides a concrete platform for measuring progress toward generalist multimodal agents.

major comments (2)
  1. [Results] Results section (performance comparison to humans): The headline claim that the best of the 18 model-interface pairs is 'far from achieving human capabilities' depends on human baselines collected under identical constraints (browser rendering, action latency, parsed vs. native controls). The manuscript provides no indication that humans were scored inside the same browser environment; if native desktop play was used instead, the reported gap conflates agent limitations with interface friction and cannot be attributed solely to perception/planning/control deficits.
  2. [Benchmark Construction] Benchmark description (§3 or equivalent): Task selection criteria for the 170 tasks across 34 games are not fully specified (e.g., coverage of game genres, difficulty calibration, or avoidance of interface-specific biases). Without these details the claim that the benchmark forms a representative test of general multimodal agent capabilities remains difficult to evaluate.
minor comments (2)
  1. [Abstract] Abstract and methods: Exact human baseline collection protocol, number of human trials, and any error bars or variance measures are not reported, even though the abstract cites 'performance gaps and robustness from repeated reruns.'
  2. [Introduction] The project page is referenced but the paper should explicitly state which evaluation details (full task lists, human protocols, raw logs) are only available online versus contained in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of GameWorld's contributions. We address each major comment point by point below, providing clarifications and indicating revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Results] Results section (performance comparison to humans): The headline claim that the best of the 18 model-interface pairs is 'far from achieving human capabilities' depends on human baselines collected under identical constraints (browser rendering, action latency, parsed vs. native controls). The manuscript provides no indication that humans were scored inside the same browser environment; if native desktop play was used instead, the reported gap conflates agent limitations with interface friction and cannot be attributed solely to perception/planning/control deficits.

    Authors: We appreciate this critical observation on ensuring fair human baselines. All human performance data were in fact collected inside the identical browser environment using the same two interfaces (direct keyboard/mouse emission and semantic action parsing) with matched rendering, latency, and control constraints. We regret that this protocol was not stated explicitly in the original Results section. We have revised the manuscript to add a dedicated subsection describing the human evaluation procedure, including participant instructions, interface screenshots, and explicit confirmation that no native desktop controls were used. This change directly resolves the concern and reinforces that the reported performance gap reflects agent limitations in perception, planning, and control. revision: yes

  2. Referee: [Benchmark Construction] Benchmark description (§3 or equivalent): Task selection criteria for the 170 tasks across 34 games are not fully specified (e.g., coverage of game genres, difficulty calibration, or avoidance of interface-specific biases). Without these details the claim that the benchmark forms a representative test of general multimodal agent capabilities remains difficult to evaluate.

    Authors: We agree that greater transparency on task selection strengthens the benchmark's claims. We have expanded Section 3 with a new subsection titled 'Task Selection Criteria and Benchmark Design.' It now details: genre coverage across 8 categories (action, puzzle, strategy, simulation, etc.) with explicit game examples; difficulty calibration via pilot human playtests and rule-based agent runs to span easy-to-hard tasks with associated completion-time statistics; and bias mitigation by verifying every task is solvable under both interfaces through deterministic simulations and excluding any game where one interface confers an inherent advantage. These additions make the representativeness of the 170 tasks explicit and support the evaluation of generalist multimodal capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent human comparisons

full rationale

The paper introduces GameWorld as a new benchmark with 34 games, 170 tasks, and two agent interfaces (computer-use and semantic-action), then reports direct empirical results across 18 model-interface pairs. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains are present in the abstract or described methods. The central claim that agents are far from human capabilities rests on outcome-based metrics and repeated reruns for robustness, not on any reduction to inputs by construction. Human baselines are positioned as external reference points, with the benchmark itself offered as an independent, reproducible evaluation tool via the project page.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical benchmark paper, the central claim rests on the design choices for games, tasks, interfaces, and metrics rather than mathematical axioms or new physical entities; no free parameters or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5573 in / 1094 out tokens · 36437 ms · 2026-05-10T17:54:29.903842+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

Reference graph

Works this paper leans on

87 extracted references · 17 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    Cubefield

    Max Abernethy. Cubefield. Flash game (preserved on Internet Archive), 2006. URLhttps://archive.org/deta ils/cubefield_flash

  2. [2]

    Amazing Adam. Vex 3. Browser platform game, 2014. URLhttps://apps.microsoft.com/detail/9ntlfr2t dg7z

  3. [3]

    Flashad- venture: A benchmark for gui agents solving full story arcs in diverse adventure games

    Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, and Gunhee Kim. Flashad- venture: A benchmark for gui agents solving full story arcs in diverse adventure games. InEMNLP, 2025

  4. [4]

    Fireboy and watergirl

    Oslo Albet. Fireboy and watergirl. Flash puzzle-platform game, 2009. URLhttps://en.wikipedia.org/wiki/ Fireboy_and_Watergirl

  5. [5]

    Claude: Constitutional ai models from anthropic.https://www.anthropic.com/claude , 2024

    Anthropic. Claude: Constitutional ai models from anthropic.https://www.anthropic.com/claude , 2024. Official description of the Claude model family

  6. [6]

    Breakout

    Atari, Inc. Breakout. Arcade game manual (preserved digital artifact), 1976. URLhttps://archive.org/deta ils/ArcadeGameManualBreakout

  7. [7]

    Webgym: Scaling training environ- ments for visual web agents with realistic tasks, 2026

    Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. Webgym: Scaling training environ- ments for visual web agents with realistic tasks, 2026

  8. [8]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  9. [9]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural InformationProcessing Systems, 35:24639–24654, 2022

    Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural InformationProcessing Systems, 35:24639–24654, 2022

  10. [10]

    Human-level play in the game of diplomacy by combining language models with strategic reasoning

    Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022

  11. [11]

    DeepMind Lab

    Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab.arXiv preprint arXiv:1612.03801, 2016

  12. [12]

    Openai gym, 2016

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016

  13. [13]

    Gabriele Cirulli. 2048. GitHub repository (browser game), 2014. URLhttps://github.com/gabrielecirulli /2048

  14. [14]

    Joseph Cloutier. Run 3. Browser game, 2014. URLhttps://player03.com/run/3/beta/

  15. [15]

    Another gentleman’s adventure

    Coolmath Games. Another gentleman’s adventure. Browser game page (Coolmath Games), 2018. URLhttps: //www.coolmathgames.com/0-another-gentlemans-adventure

  16. [16]

    Textworld: A learning environment for text-based games

    Marc-Alexandre Côté, Akos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al. Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on 48 Artificial Intelligence, IJCAI 2018,...

  17. [17]

    The world’s hardest game

    Stephen Critoph. The world’s hardest game. Flash game, 2007. URLhttps://en.wikipedia.org/wiki/The_ World%27s_Hardest_Game

  18. [18]

    The world’s hardest game 2

    Stephen Critoph. The world’s hardest game 2. Flash game, 2008. URLhttps://archive.org/details/worlds hardestgame2_202310

  19. [19]

    Boxel rebound

    Jacob DeBenedetto. Boxel rebound. Browser game / extension distribution (official developer site), 2017. URL https://www.dopplercreative.com/boxel-rebound/privacy-policy

  20. [20]

    Dedra Games. Ovo. Google Play app listing, 2018. URLhttps://play.google.com/store/apps/details?i d=com.dedra.ovo

  21. [21]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InThirty-seventh Conference on Neural Information Processing Systems,

  22. [22]

    URLhttps://openreview.net/forum?id=kiYqbO3wqw

  23. [23]

    The adventures of captain callisto

    Cody Ebberson. The adventures of captain callisto. JS13K Games entry (browser game), 2021. URLhttps: //js13kgames.com/entries/the-adventures-of-captain-callisto

  24. [24]

    Minedojo: Building open-ended embodied agents with internet-scale knowledge

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural InformationProcessing Systems, 35:18343–18362, 2022

  25. [25]

    Fullscreenmario

    Karol Franz. Fullscreenmario. GitHub repository (browser game engine/implementation), 2013. URLhttps: //github.com/karol-f/FullScreenMario

  26. [26]

    Assistgui: Task-oriented desktop graphical user interface automation

    Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. Assistgui: Task-oriented desktop graphical user interface automation.arXiv preprint arXiv:2312.13108, 2023

  27. [27]

    Gemini: A family of highly capable multimodal models, 2025

    Gemini Team. Gemini: A family of highly capable multimodal models, 2025. URLhttps://arxiv.org/abs/2312 .11805

  28. [28]

    Gemini 2.5 computer use model card.https://storage.googleapis.com/deepmind-media/M odel-Cards/Gemini-2-5-Computer-Use-Model-Card.pdf , 2025

    Google. Gemini 2.5 computer use model card.https://storage.googleapis.com/deepmind-media/M odel-Cards/Gemini-2-5-Computer-Use-Model-Card.pdf , 2025. System card describing the Gemini 2.5 Computer Use model

  29. [29]

    Chrome dino (offline dinosaur game)

    Google Chrome team. Chrome dino (offline dinosaur game). Built-in browser game in Google Chrome, 2014. URL https://blog.google/products-and-platforms/products/chrome/chrome-dino/

  30. [30]

    Google snake

    Google LLC. Google snake. Google Doodle browser game, 2013. URLhttps://www.google.com/fbx?fbx=sna ke_arcade

  31. [31]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  32. [32]

    Hextris contributors. Hextris. GitHub repository (browser game), 2014. URLhttps://github.com/Hextris/h extris

  33. [33]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025

  34. [34]

    Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?, 2025

  35. [35]

    The dawn of gui agent: A preliminary case study with claude 3.5 computer use,

    Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323, 2024

  36. [36]

    Geodash 2.2

    IdeiGeniale. Geodash 2.2. GitHub repository (browser game), 2025. URLhttps://github.com/IdeiGeniale /GeoDash2.2

  37. [37]

    Temple run 2

    Imangi Studios. Temple run 2. Google Play listing, 2013. URLhttps://play.google.com/store/apps/detai ls?id=com.imangi.templerun2. 49

  38. [38]

    Toru Iwatani. Pac-man. Official franchise history page, 1980. URLhttps://pacman.com/en/history/

  39. [39]

    Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents, 2025

    Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents, 2025

  40. [40]

    Ketchapp. Stack. App Store listing, 2016. URLhttps://apps.apple.com/us/app/stack/id1080487957

  41. [41]

    Restless wing syndrome

    Leko. Restless wing syndrome. itch.io game page, 2020. URLhttps://leko.itch.io/restless-wing-syndr ome

  42. [42]

    Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse.arXiv preprint arXiv:2503.16365, 2025

    Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, and Yitao Liang. Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse.arXiv preprint arXiv:2503.16365, 2025

  43. [43]

    Steve-1: A generative model for text-to-behavior in minecraft.Advances in Neural InformationProcessing Systems, 36:69900–69929, 2023

    Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft.Advances in Neural InformationProcessing Systems, 36:69900–69929, 2023

  44. [44]

    Nitrogen: An open foundation model for generalist gaming agents, 2025

    Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, et al. Nitrogen: An open foundation model for generalist gaming agents, 2025. Preprint

  45. [45]

    Rocket league 2d

    Gurpreet Singh Matharoo. Rocket league 2d. itch.io game page, 2017. URLhttps://matharoo.itch.io/rl2d

  46. [46]

    Microsoft minesweeper

    Microsoft. Microsoft minesweeper. Microsoft Store app listing, 2012. URLhttps://apps.microsoft.com/det ail/9wzdncrfhwcn

  47. [47]

    Microsoft edge surf

    Microsoft Edge team. Microsoft edge surf. Built-in browser game in Microsoft Edge, 2020. URLhttps://blogs.wi ndows.com/msedgedev/2020/05/26/surf-game-edge-stable/

  48. [48]

    Ns-shaft

    NAGI-P SOFT. Ns-shaft. Official developer download/info page, 2001. URLhttps://www.nagi-p.com/v1/eng/ nsshaft.html

  49. [49]

    Flappy bird

    Dong Nguyen. Flappy bird. Mobile game, 2013. URLhttps://en.wikipedia.org/wiki/Flappy_Bird

  50. [50]

    Computer-using agent

    OpenAI. Computer-using agent. https://openai.com/index/computer-using-agent/ , 2025. Accessed 2025-08-18

  51. [51]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf , 2025. System card describing the GPT-5 model family

  52. [52]

    Balrog: Benchmarking agentic llm and vlm reasoning on games, 2025

    Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wołczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. Balrog: Benchmarking agentic llm and vlm reasoning on games, 2025. ICLR 2025

  53. [53]

    Alexey Pajitnov. Tetris. Official history/about page, 1984. URLhttps://tetris.com/about

  54. [54]

    Orak: A foundational benchmark for training and evaluating llm agents on diverse video games, 2025

    Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, et al. Orak: A foundational benchmark for training and evaluating llm agents on diverse video games, 2025. Preprint and project release

  55. [55]

    Doodle jump

    Igor Pušenjak and Marko Pušenjak. Doodle jump. Mobile game, 2009. URLhttps://en.wikipedia.org/wiki/ Doodle_Jump

  56. [56]

    Core ball

    randomyang. Core ball. GitHub repository (HTML5 browser game), 2015. URLhttps://github.com/randomy ang/core-ball

  57. [57]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    TimoSchick,JaneDwivedi-Yu,RobertoDessì,RobertaRaileanu,MariaLomeli,LukeZettlemoyer,NicolaCancedda,and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

  58. [58]

    Ui-tars-1.5.https://seed-tars.com/1.5, 2025

    ByteDance Seed. Ui-tars-1.5.https://seed-tars.com/1.5, 2025

  59. [59]

    Wolfenstein 3d html5

    Jacob Seidelin. Wolfenstein 3d html5. GitHub repository, 2012. URLhttps://github.com/jseidelin/wolf3d

  60. [60]

    Scaling instructable agents across many simulated worlds, 2024

    SIMA Team, Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, et al. Scaling instructable agents across many simulated worlds, 2024

  61. [61]

    Sima 2: A generalist embodied agent for virtual worlds, 2025

    SIMA Team, Adrian Bolton, Alexander Lerchner, Alexandra Cordell, et al. Sima 2: A generalist embodied agent for virtual worlds, 2025. 50

  62. [62]

    Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents, 2025

    Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, and Xiaojie Wang. Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents, 2025

  63. [63]

    Videogameqa-bench: Evaluating vision-language models for video game quality assurance

    Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, and Cor-Paul Bezemer. Videogameqa-bench: Evaluating vision-language models for video game quality assurance. InNeurIPS Datasets and Benchmarks Track, 2025. Paper reports 9 QA task types and 4,786 questions over 800+ games

  64. [64]

    Cradle: Empowering foundation agents towards general computer control,

    Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186, 2024

  65. [65]

    Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

    Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, et al. Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

  66. [66]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  67. [67]

    Rye Terrell. Astray. GitHub repository and GitHub Pages browser game, 2015. URLhttps://github.com/wwwty ro/Astray

  68. [68]

    Monkey mart

    TinyDobbins. Monkey mart. Browser game page (Poki), 2022. URLhttps://poki.com/en/g/monkey-mart

  69. [69]

    Game-rl: Synthesizing multimodal verifiable game data to boost vlms’ general reasoning, 2025

    Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Zhiheng Xi, Changhao Jiang, Zhangyue Yin, Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. Game-rl: Synthesizing multimodal verifiable game data ...

  70. [70]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

  71. [71]

    Are large vision language models good game players?, 2025

    Xinyu Wang, Bohan Zhuang, and Qi Wu. Are large vision language models good game players?, 2025

  72. [72]

    Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

    Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.arXiv preprint arXiv:2311.05997, 2023

  73. [73]

    Game-tars: Pretrained foundation models for scalable generalist multimodal game agents, 2025

    Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, et al. Game-tars: Pretrained foundation models for scalable generalist multimodal game agents, 2025. Technical report

  74. [74]

    Josh Wardle. Wordle. Browser game, 2021. URLhttps://en.wikipedia.org/wiki/Wordle

  75. [75]

    Grok: xai’s multimodal reasoning model.https://x.ai/blog/grok, 2024

    xAI. Grok: xai’s multimodal reasoning model.https://x.ai/blog/grok, 2024. Official description of the Grok model family

  76. [76]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

  77. [77]

    Play to generalize: Learning to reason through game play, 2025

    Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, and Chen Wei. Play to generalize: Learning to reason through game play, 2025

  78. [78]

    Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, and Bo Zheng

    Xinrun Xu, Pi Bu, Ye Wang, Börje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, and Bo Zheng. Deepphy: Benchmarking agentic vlms on physical reasoning, 2025. URLhttps://arxiv.org/abs/25 08.05405

  79. [79]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  80. [80]

    Zhang, Thomas L

    Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, and Ofir Press. Videogamebench: Can vision-language models complete popular video games?, 2025

Showing first 80 references.