pith. sign in

arxiv: 2606.09826 · v1 · pith:HZASDSOLnew · submitted 2026-06-08 · 💻 cs.CV · cs.AI

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Pith reviewed 2026-06-27 16:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords VLM game agentsUnreal Engine 5 benchmarkimprovement dynamics curveagent reflectionmulti-agent gamesvision-language modelsunified action interfaces
0
0 comments X

The pith

OmniGameArena supplies twelve new UE5 games and an IDC harness so VLM agents can be scored on cold-start performance plus how their skills evolve through autonomous reflection rounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing game benchmarks for vision-language model agents typically record only a single first-attempt score, concentrate on solo play, and offer no common protocol for comparing commercial VLMs, open-weight VLMs, and specialized policies. The paper introduces OmniGameArena, a real-time collection of twelve newly constructed Unreal Engine 5 games that span seven solo, three player-versus-player, and two cooperative scenarios, all sharing unified action interfaces. It also defines the Improvement Dynamics Curve, an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across successive rounds. The resulting observables are the initial score, the trajectory of scores across rounds, and performance of the refined skill on held-out task variants. These measurements are reported for twelve VLM agents on the cold-start leaderboard and for four top agents under the IDC process.

Core claim

OmniGameArena consists of twelve newly built Unreal Engine 5 games covering solo, PvP, and coop play with unified action interfaces, paired with the Improvement Dynamics Curve harness in which a tool-using reflector LLM autonomously refines bounded skill prompts across multiple rounds, thereby exposing for each agent-game pair both the evolution of scores across reflection rounds and the behavior of the learned skill on held-out task variants.

What carries the argument

The Improvement Dynamics Curve (IDC), an agentic-reflection harness that uses a tool-using reflector LLM to autonomously refine a bounded skill prompt across successive rounds.

If this is right

  • VLM agents of different types become directly comparable on the same set of games and interfaces.
  • Score change across reflection rounds becomes a standard observable in addition to the initial score.
  • Generalization is measured by testing the refined skill on held-out task variants.
  • Evaluation extends naturally to PvP and cooperative multi-agent settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents might be trained explicitly to maximize improvement rate rather than single-shot peak performance.
  • The IDC approach could be applied to other interactive domains such as robotic control or simulation environments.
  • Observed dynamics may vary with the choice of reflector LLM, suggesting separate study of that component.

Load-bearing premise

The twelve newly constructed UE5 games together with their unified action interfaces and the reflector LLM inside the IDC form a representative testbed that fairly compares heterogeneous agent classes without introducing design artifacts or reflection biases.

What would settle it

Finding that replacement of the reflector LLM or substitution of the custom games with existing commercial titles produces substantially different improvement trajectories or leaderboard orderings would show the benchmark does not deliver stable, unbiased observables.

Figures

Figures reproduced from arXiv: 2606.09826 by Fan Zhang, Lingting Zhu, Mingxian Lin, Shengju Qian, Wei Huang, Xiaojuan Qi, Xin Wang, Yi-Hua Huang, Yitang Li, Yiyu Wang, Yuqi Liu, Zeyu Hu.

Figure 1
Figure 1. Figure 1: OmniGameArena at a glance. Twelve newly built UE5 games span Solo (7), PvP (3), and Coop (2) regimes [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Radar charts of the 12 OmniGameArena games across seven capability dimensions. The abbreviations [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Improvement Dynamics Curve (IDC) harness. The [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PvP win rates of Player 1 (row) against Player 2 (column) per game over all pairings. therefore evaluate under two clock modes that both pause the environment during inference: Paused Decision Quality (PDQ) freezes the environment for the full inference call and treats decision time as free, isolating pure decision quality; Latency￾Controlled Real-Time (LCRT) additionally idles for the server-reported infe… view at source ↗
Figure 5
Figure 5. Figure 5: IDC curves: per-round mean episode score [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PvP win rates of Player 1 (row) against Player 2 (column) on MidlineClash under latency con￾trol setting. of clock mode rather than a change in model abil￾ity. Charging decision latency against the game clock leaves far less game time per move, so each player completes only about 18 actions per game under LCRT against the ∼42-action PDQ budget, and matches compress into low-scoring, frequently drawn games.… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on Last Stand using GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison on Shared Floor using Gemini-3.1-Pro. tile-survival behavior, while SharedFloor prompts emphasize cooperative division of labor, station alignment, and order-refresh handling. The scope here is limited to LastStand and SharedFloor; Ob￾stacleRun3D is intentionally excluded. D Visualization Visualization results are shown in Figures 9–20. For each game, we visualize representative traj… view at source ↗
Figure 9
Figure 9. Figure 9: Visualization results for cue_chase. Each row shows one model, with five sampled frames from the corresponding trajectory. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization results for last_stand. Each row shows one model, with five sampled frames from the corresponding trajectory. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization results for monster_shoot. Each row shows one model, with five sampled frames from the corresponding trajectory. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization results for obstacle_run_2d. Each row shows one model, with five sampled frames from the corresponding trajectory. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualization results for obstacle_run_3d. Each row shows one model, with five sampled frames from the corresponding trajectory. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualization results for scene_escape. Each row shows one model, with five sampled frames from the corresponding trajectory. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visualization results for solo_craft. Each row shows one model, with five sampled frames from the corresponding trajectory. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Visualization results for handoff_run. Each row shows one cooperative model pair, with five sampled frames from the corresponding episode. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Visualization results for shared_floor. Each row shows one cooperative model pair, with five sampled frames from the corresponding episode. GPT-5.5 vs Claude Opus 4.6 Kimi-K2.5 vs Claude Opus 4.6 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Visualization results for crystal_guard. Each row shows one representative PvP matchup, with five sampled frames from the corresponding match. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Visualization results for midline_clash. Each row shows one representative PvP matchup, with five sampled frames from the corresponding match. Claude Opus 4.6 vs Kimi-K2.5 Claude Opus 4.6 vs GPT-5.5 [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Visualization results for sky_duel. Each row shows one representative PvP matchup, with five sampled frames from the corresponding match. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗
read the original abstract

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces OmniGameArena, a real-time benchmark consisting of twelve newly constructed Unreal Engine 5 games (7 Solo, 3 PvP, 2 Coop) equipped with unified action interfaces, together with the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt over multiple rounds. Beyond cold-start leaderboard scores for twelve VLM agents, the work reports two additional observables per (agent, game) pair: score evolution across reflection rounds and behavior of the learned skill on held-out task variants, with IDC results shown for the four top agents.

Significance. If the new environments and IDC harness prove free of systematic bias, the benchmark could supply a needed unified protocol for comparing commercial VLMs, open-weight VLMs, and specialized policies across solo and multi-agent settings, while the IDC observables would add longitudinal and generalization information absent from single-shot game benchmarks. The explicit construction of real-time UE5 titles and the agentic-reflection mechanism are concrete contributions that could be adopted by the VLM-agent community.

major comments (2)
  1. [Benchmark Construction and Game Descriptions] The central claim that the twelve author-built UE5 games plus unified action interfaces constitute a neutral, representative testbed is unsupported by any validation, cross-benchmark calibration, or analysis of potential design artifacts (game mechanics, visual cues, reward structures, or action mappings). This is load-bearing for the fairness of both the cold-start leaderboard and the IDC curves.
  2. [IDC Harness and Experimental Protocol] No ablation or control experiment isolates whether observed IDC score evolution arises from intrinsic VLM agent improvement or from the reflector LLM's prompt-refinement policy. Without such separation, the reported improvement dynamics cannot be unambiguously attributed to the evaluated agents.
minor comments (2)
  1. [Results] The abstract states that observables are reported for twelve agents on the cold-start leaderboard and four under IDC, yet the manuscript should include the precise numerical values, variance estimates, and statistical tests in a dedicated results table or figure for reproducibility.
  2. [Related Work] References to prior VLM game benchmarks (e.g., those using established titles) are needed to situate the novelty of the twelve new UE5 environments and to allow readers to assess the claimed unification of protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, proposing targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Benchmark Construction and Game Descriptions] The central claim that the twelve author-built UE5 games plus unified action interfaces constitute a neutral, representative testbed is unsupported by any validation, cross-benchmark calibration, or analysis of potential design artifacts (game mechanics, visual cues, reward structures, or action mappings). This is load-bearing for the fairness of both the cold-start leaderboard and the IDC curves.

    Authors: We agree the manuscript provides no cross-benchmark calibration or systematic artifact analysis. The twelve games were newly constructed in UE5 specifically to span Solo, PvP, and Coop settings under a single action interface, enabling direct comparison of commercial VLMs, open-weight VLMs, and specialized policies. In revision we will add an appendix with per-game descriptions of mechanics, visual cues, reward structures, and action mappings, plus a limitations paragraph discussing possible design biases. Full external calibration remains outside the current scope. revision: partial

  2. Referee: [IDC Harness and Experimental Protocol] No ablation or control experiment isolates whether observed IDC score evolution arises from intrinsic VLM agent improvement or from the reflector LLM's prompt-refinement policy. Without such separation, the reported improvement dynamics cannot be unambiguously attributed to the evaluated agents.

    Authors: We concur that the present protocol lacks an explicit control isolating the reflector LLM's contribution. The IDC design uses a fixed reflector policy across agents to measure how each VLM agent's performance changes when given progressively refined skill prompts. In the revised version we will include a control condition in which the reflector applies a non-adaptive (fixed or random) prompt policy and report the resulting score trajectories for the four top agents, allowing direct comparison to the adaptive-reflector results. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark definition is self-contained

full rationale

The manuscript defines a new benchmark (OmniGameArena) consisting of author-built UE5 games and a new evaluation harness (IDC) without any equations, fitted parameters, predictions, or derivations. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The observables (cold-start scores, score evolution, held-out behavior) are direct measurements on the defined testbed rather than quantities derived from prior results by construction. This matches the expected non-finding for a benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the contribution is described as new benchmark construction without detailed technical dependencies.

pith-pipeline@v0.9.1-grok · 5753 in / 1109 out tokens · 25863 ms · 2026-06-27T16:48:21.451851+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 23 canonical work pages · 6 internal anchors

  1. [1]

    2026 , howpublished =

    Kimi K2.5: Visual Agentic Intelligence , author =. 2026 , howpublished =

  2. [8]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  3. [9]

    Advances in Neural Information Processing Systems , volume=

    The nethack learning environment , author=. Advances in Neural Information Processing Systems , volume=

  4. [10]

    Advances in Neural Information Processing Systems , volume=

    Chessgpt: Bridging policy learning and language modeling , author=. Advances in Neural Information Processing Systems , volume=

  5. [12]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Interactive fiction games: A colossal adventure , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  6. [19]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Learning Physics-Based Full-Body Human Reaching and Grasping from Brief Walking References , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  7. [20]

    Advances in Neural Information Processing Systems , volume=

    Minedojo: Building open-ended embodied agents with internet-scale knowledge , author=. Advances in Neural Information Processing Systems , volume=

  8. [25]

    arXiv preprint arXiv:2409.12889 , year=

    Can vlms play action role-playing games? take black myth wukong as a study case , author=. arXiv preprint arXiv:2409.12889 , year=

  9. [28]

    GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

    GameWorld: Towards standardized and verifiable evaluation of multimodal game agents , author=. arXiv preprint arXiv:2604.07429 , year=

  10. [29]

    Advances in neural information processing systems , volume=

    Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

  11. [30]

    Advances in neural information processing systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

  12. [31]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Expel: Llm agents are experiential learners , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  13. [32]

    2026 , month = may, howpublished =

    Learning Beyond Gradients , author =. 2026 , month = may, howpublished =

  14. [35]

    Anthropic . 2026 a . Introducing C laude O pus 4.6. https://www.anthropic.com/news/claude-opus-4-6

  15. [36]

    Anthropic . 2026 b . Introducing C laude O pus 4.7. https://www.anthropic.com/news/claude-opus-4-7

  16. [37]

    Anthropic . 2026 c . Introducing C laude S onnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6

  17. [38]

    Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. 2026. Webgym: Scaling training environments for visual web agents with realistic tasks. arXiv preprint arXiv:2601.02439

  18. [39]

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. 2022. Minedojo: Building open-ended embodied agents with internet-scale knowledge. volume 35, pages 18343--18362

  19. [40]

    Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. 2023. Chessgpt: Bridging policy learning and language modeling. Advances in Neural Information Processing Systems, 36:7216--7262

  20. [41]

    Google . 2026 a . Introducing G emini 3.1 F lash- L ite. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/

  21. [42]

    Google . 2026 b . Introducing G emini 3.1 P ro. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

  22. [43]

    Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre C \^o t \'e , and Xingdi Yuan. 2020. Interactive fiction games: A colossal adventure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7903--7910

  23. [44]

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. 2025. lmgame-bench: How good are llms at playing games? arXiv preprint arXiv:2505.15146

  24. [45]

    Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, and Hao Zhang. 2024. Gamearena: Evaluating llm reasoning through live computer games. arXiv preprint arXiv:2412.06394

  25. [46]

    Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, and Michael R Lyu. 2024. How far are we on the decision-making of llms? evaluating llms' gaming ability in multi-agent environments. arXiv preprint arXiv:2403.11807

  26. [47]

    u ttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rockt \

    Heinrich K \"u ttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rockt \"a schel. 2020. The nethack learning environment. Advances in Neural Information Processing Systems, 33:7671--7684

  27. [48]

    Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, and Yitao Liang. 2025. Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse. In Findings of the Association for Computational Linguistics: ACL 2025, pages 17878--17899

  28. [49]

    Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, and Xiaojuan Qi. 2025. Embrace-3k: Embodied reasoning and action in complex environments. arXiv preprint arXiv:2507.10548

  29. [50]

    Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, and Yang You. 2025. V-reasonbench: Toward unified reasoning benchmark suite for video generation models. arXiv preprint arXiv:2511.16668

  30. [51]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 others. 2023. Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems, 36:46534--46594

  31. [52]

    Lo \" c Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, and 1 others. 2026. Nitrogen: An open foundation model for generalist gaming agents. arXiv preprint arXiv:2601.02427

  32. [53]

    Moonshot AI . 2026. Kimi k2.5: Visual agentic intelligence. https://www.kimi.com/blog/kimi-k2-5

  33. [54]

    OpenAI . 2026 a . Introducing GPT -5.4. https://openai.com/index/introducing-gpt-5-4/

  34. [55]

    OpenAI . 2026 b . Introducing GPT -5.5. https://openai.com/index/introducing-gpt-5-5/

  35. [56]

    Davide Paglieri, Bart omiej Cupia , Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, ukasz Kuci \'n ski, Lerrel Pinto, Rob Fergus, and 1 others. 2024. Balrog: Benchmarking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543

  36. [57]

    Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, and 1 others. 2025. Orak: A foundational benchmark for training and evaluating llm agents on diverse video games. arXiv preprint arXiv:2506.03610

  37. [58]

    Qwen Team . 2026. Qwen3.5 : A N ative M ultimodal F oundation M odel for E fficiency. https://qwen.ai/blog?id=qwen3.5

  38. [59]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634--8652

  39. [60]

    Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, and 1 others. 2024. Towards general computer control: A multimodal agent for red dead redemption ii as a case study. arXiv preprint arXiv:2403.03186, 1(2)

  40. [61]

    Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, and 1 others. 2025. Lumine: An open recipe for building generalist agents in 3d open worlds. arXiv preprint arXiv:2511.08892

  41. [62]

    Chen Feng Tsai, Xiaochen Zhou, Sierra S Liu, Jing Li, Mo Yu, and Hongyuan Mei. 2023. Can large language models play text games well? current state-of-the-art and open questions. arXiv preprint arXiv:2304.02868

  42. [63]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models, 2023. URL https://arxiv. org/abs/2305.16291, 2(11)

  43. [64]

    Xinyu Wang, Bohan Zhuang, and Qi Wu. 2025 a . Are large vision language models good game players? arXiv preprint arXiv:2503.02358

  44. [65]

    Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, and 1 others. 2025 b . Game-tars: Pretrained foundation models for scalable generalist multimodal game agents. arXiv preprint arXiv:2510.23691

  45. [66]

    Jiayi Weng. 2026. Learning beyond gradients. https://trinkle23897.github.io/learning-beyond-gradients/. Blog post

  46. [67]

    Yue Wu, Xuan Tang, Tom M Mitchell, and Yuanzhi Li. 2023. Smartplay: A benchmark for llms as intelligent agents. arXiv preprint arXiv:2310.01557

  47. [68]

    Yuguang Yue, Irakli Salia, Samuel Hunt, Chris Green, Wenzhe Shi, and Jonathan J Hunt. 2026. Scaling behavior cloning improves causal reasoning: An open model for real-time video game playing. arXiv preprint arXiv:2601.04575

  48. [69]

    Alex L Zhang, Thomas L Griffiths, Karthik R Narasimhan, and Ofir Press. 2025. Videogamebench: Can vision-language models complete popular video games? arXiv preprint arXiv:2505.18134

  49. [70]

    Kuan Zhang, Dongchen Liu, Qiyue Zhao, Jinkun Hou, Xinran Zhang, Qinlei Xie, Miao Liu, and Yiming Li. 2026. Gameverse: Can vision-language models learn from video-based reflection? arXiv preprint arXiv:2603.06656

  50. [71]

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2024. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632--19642

  51. [72]

    Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, and Lijuan Wang. 2025. V-mage: A game evaluation framework for assessing vision-centric capabilities in multimodal large language models. arXiv preprint arXiv:2504.06148

  52. [73]

    Lingting Zhu, Shengju Qian, Haidi Fan, Jiayu Dong, Zhenchao Jin, Siwei Zhou, Gen Dong, Xin Wang, and Lequan Yu. 2026. Assetformer: Modular 3d assets generation with autoregressive transformer. arXiv preprint arXiv:2602.12100