pith. sign in

arxiv: 2606.18950 · v2 · pith:ZVY5V3WFnew · submitted 2026-06-17 · 💻 cs.AI

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

Pith reviewed 2026-06-26 20:59 UTC · model grok-4.3

classification 💻 cs.AI
keywords vision-language modelsreal-time strategy gamesstrategic reasoningmulti-agent coordinationbenchmarkself-evolving generationpartial observability
0
0 comments X

The pith

State-of-the-art vision-language models perform poorly on RTS tasks requiring tighter coordination and larger scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces RTSGameBench, built on the large-scale Beyond All Reason game, to diagnose limitations in how vision-language models handle strategic reasoning under partial observability and uncertainty. The benchmark evaluates models across diverse matchups, uses mini-games that each isolate one strategic competency, and applies a self-evolving generation framework that turns free-form queries into new mini-games to expand coverage over cycles. It also supplies RTSGameAgent, which uses a finite state machine plus agentic memory to let VLMs control units in these environments. The empirical results show that current VLMs degrade when matchups need tighter or multi-agent coordination and when task scale grows. A sympathetic reader would care because these findings point to gaps in model capabilities for any setting that mixes long-horizon planning, opponent adaptation, and teamwork.

Core claim

RTSGameBench provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games improving over successive cycles. RTSGameAgent manages units by an FSM with agentic memory. Multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

What carries the argument

RTSGameBench's diagnostic mini-games, each targeting an individual strategic competency, together with its self-evolving generation framework that converts free-form queries into new mini-games.

If this is right

  • VLMs require improved handling of multi-agent coordination under partial observability.
  • Performance drops occur as RTS task scale increases.
  • The benchmark enables systematic isolation and diagnosis of specific strategic weaknesses.
  • Self-evolving scenario generation expands test coverage automatically across cycles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark's structure could be adapted to evaluate VLMs in non-game domains that also require opponent modeling and long-horizon coordination, such as multi-robot task allocation.
  • The identified coordination and scaling failures suggest that hybrid VLM-plus-classical-planner systems may be needed before reliable deployment in uncertain multi-agent settings.
  • Successive cycles of the self-evolving framework could be used to generate progressively harder tests that track whether model improvements close the observed gaps.

Load-bearing premise

The mini-games each target an individual strategic competency and the self-evolving generation framework converts free-form queries into new mini-games that improve coverage over successive cycles.

What would settle it

Demonstrating that multiple state-of-the-art VLMs achieve high win rates or strong competency scores on the coordination-heavy mini-games and on larger-scale RTS matchups would falsify the reported performance limitations.

Figures

Figures reproduced from arXiv: 2606.18950 by Daechul Ahn, Hyeonbeom Choi, Jonghyun Choi, Reokyoung Kim, San Kim, Seungyeon Jwa.

Figure 1
Figure 1. Figure 1: Overview of RTSGameBench. We evaluate VLMs’ strategic reasoning through three components: (1) Full Game Evaluation across diverse matchup structures; (2) Diagnostic Mini-Games each targeting an individual strategic competency; and (3) a Self-Evolving Game Generation Framework that converts free-form queries into new diagnostic games via multi-agent collaboration, enabling on-demand extensibility. converts … view at source ↗
Figure 2
Figure 2. Figure 2: Diagnostic mini-games. Each scenario targets a core strategic competency— resource management, spatial and temporal reasoning, opponent modeling, collaboration, and adversarial planning—with fog-of-war selectively applied per game (Tab. 2). 3.2 Diagnostic Mini-Games Full game play necessitates the simultaneous application of diverse strategic competencies, which often conflates distinct behavioral traits a… view at source ↗
Figure 3
Figure 3. Figure 3: Self-evolving game generation pipeline. A project manager (PM) orches￾trates VLM-based agents through four stages—scenario planning, GDD generation, rule set construction, and game implementation—with inter-stage gating and rollback. A shared knowledge database stores validated GDDs and rule sets, enabling reuse and fast-tracking. PM’s retrospective analysis refines quality rubrics after each generation. (… view at source ↗
Figure 4
Figure 4. Figure 4: Inference loop of RTSGameAgent. At each decision step, the memory phase (left) consolidates short-term event logs St with long-term memory Lt−1 via an LLM, producing relevant entries mt and updated memory Lt. The decision phase (right) feeds mt, game knowledge K, and multimodal observations ot to the VLM policy π, which outputs four action types: building construction, unit production, group assignment, an… view at source ↗
Figure 6
Figure 6. Figure 6: Self-evolution over successive generation batches. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of a mini-game generated by the self-evolving framework. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 1
Figure 1. Figure 1: Large-scale combat environments in BAR. Snapshots of large-scale battles in Beyond All Reason. Top: a four-team battle where each team deploys 20 commanders (80 commanders in total). Bottom: a 50vs50 team battle where each side controls 50 commanders, illustrating the extreme scale of combat where thousands of units can engage simultaneously on the battlefield. Map size. We compare the largest available ma… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the base agent interface. At each decision step t, the agent receives a multimodal observation ot consisting of visual channels vt (a global minimap and local camera views) and a structured textual observation W(st) extracted by a Python wrapper. Combined with static game knowledge K, the VLM policy π generates an action plan over three action types: building construction, unit production, and … view at source ↗
Figure 3
Figure 3. Figure 3: Initial configurations of the diagnostic mini-games. [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Full-game performance across difficulty levels. [PITH_FULL_IMAGE:figures/full_fig_p035_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Map layouts and starting positions. Blue markers indicate the starting positions of the controlled team (including teammates), while markers in other colors denote enemy starting positions. In the 3v3 setting, the red starting points are placed symmetrically with respect to the blue starting points. Decision phase. The system prompt, input example, and output example used in the decision phase are shown in… view at source ↗
Figure 6
Figure 6. Figure 6: Example of decision execution in a full-game scenario. [PITH_FULL_IMAGE:figures/full_fig_p037_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Inference interval sensitivity analysis. [PITH_FULL_IMAGE:figures/full_fig_p038_7.png] view at source ↗
Figure 19
Figure 19. Figure 19: Stage 2: game design document (GDD) generation. The prompts and rubric used in the GDD generation stage are shown in Figures 20 to 23. In this stage, the designer first generates the game design document (GDD), the analyst evaluates it using a predefined rubric, and the designer refines the GDD based on the feedback. Stage 3: rule set construction. The prompts and rubric used in the rule set construction … view at source ↗
Figure 8
Figure 8. Figure 8: Representative mini-games generated by the self-evolving game gen [PITH_FULL_IMAGE:figures/full_fig_p042_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of updated rubrics during Self-Evolving. [PITH_FULL_IMAGE:figures/full_fig_p043_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: System prompt for the memory phase. The system prompt used in the memory phase to manage and update the agent’s memory before the decision phase [PITH_FULL_IMAGE:figures/full_fig_p045_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Input example for the memory phase. The input to the memory phase consists of the current memory state accumulated so far and newly observed triggers, such as enemy sightings or combat outcomes [PITH_FULL_IMAGE:figures/full_fig_p046_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Output example for the memory phase. The output of the memory phase consists of an updated memory state reflecting the newly observed triggers and a set of retrieved memories provided to guide decision-making in the decision phase [PITH_FULL_IMAGE:figures/full_fig_p047_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: System prompt for the decision phase. Placeholders(denoted as {...}) in the system prompt are replaced with the game description, unit knowledge, and team color specification. The game description provides a textual explanation of each game, the unit knowledge contains brief descriptions of units available in the game, and the team color specification maps each team to its corresponding color in the game … view at source ↗
Figure 14
Figure 14. Figure 14: Input example for the decision phase. The input to the decision phase consists of both textual observations and visual observations of the current game state [PITH_FULL_IMAGE:figures/full_fig_p049_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Output example for the decision phase. The output of the decision phase specifies actions across four decision categories: constructing buildings, producing units, assigning units to groups, and issuing movement commands to groups [PITH_FULL_IMAGE:figures/full_fig_p050_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: System prompt for inter-stage gating. The Project Manager reviews the feedback history and determines which agent to route the request to for the next step [PITH_FULL_IMAGE:figures/full_fig_p053_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: System prompt for Game Summary. After the game generation process is completed, the Project Manager summarizes the generated game and provides the user with an overview of its key properties [PITH_FULL_IMAGE:figures/full_fig_p054_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: System prompt for rubric update. After the game generation process is completed, the Project Manager updates the rubric for future game generation by reflecting on the history accumulated during the current game creation process [PITH_FULL_IMAGE:figures/full_fig_p055_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: System prompt for interaction between the designer and the user. [PITH_FULL_IMAGE:figures/full_fig_p056_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: System prompt for generating GDD. In Stage 2, the Designer agent generates a Game Design Document (GDD) based on the scenario brief produced in the previous stage [PITH_FULL_IMAGE:figures/full_fig_p057_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: System prompt for validating GDD. The Analyst agent reviews and validates the Game Design Document (GDD) generated by the Designer agent [PITH_FULL_IMAGE:figures/full_fig_p058_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Rubrics for GDD. The initial rubric used by the Analyst agent to validate the GDD in [PITH_FULL_IMAGE:figures/full_fig_p059_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: System prompt for refining GDD. The Designer agent refines the GDD by incorporating feedback provided by the Analyst agent [PITH_FULL_IMAGE:figures/full_fig_p060_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: System prompt for generating rule. In Stage 3, the Developer agent generates Lua code to implement the rules specified in the GDD [PITH_FULL_IMAGE:figures/full_fig_p061_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: System prompt for generating test code for rule. [PITH_FULL_IMAGE:figures/full_fig_p062_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: System prompt for validating rule. The Analyst agent evaluates the Lua rule implementation and its corresponding test code by reviewing both the code and the simulation results produced from the tests. { [PITH_FULL_IMAGE:figures/full_fig_p063_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Rubrics for Rule. The rubric used by the Analyst agent for rule validation in [PITH_FULL_IMAGE:figures/full_fig_p063_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: System prompt for refining rule. The Developer agent refines the Lua rule implementation based on feedback provided by the Analyst agent in [PITH_FULL_IMAGE:figures/full_fig_p064_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: System prompt for selecting map. In Stage 4, the Developer agent selects an appropriate map for the game based on the GDD and the implemented rules [PITH_FULL_IMAGE:figures/full_fig_p065_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: System prompt for placing units. The Developer agent determines the placement of units on the selected map based on the specifications defined in the GDD and the implemented rules [PITH_FULL_IMAGE:figures/full_fig_p066_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: System prompt for defining rule configuration. [PITH_FULL_IMAGE:figures/full_fig_p067_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: System prompt for defining end condition. [PITH_FULL_IMAGE:figures/full_fig_p068_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: System prompt for visual evaluation. The Analyst agent analyzes the final game by running simulations and evaluating the resulting visualizations using the rubric. { [PITH_FULL_IMAGE:figures/full_fig_p069_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Rubrics for Final Script. The rubric used by the Analyst agent for the final game evaluation in [PITH_FULL_IMAGE:figures/full_fig_p069_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: System prompt for refining final script. [PITH_FULL_IMAGE:figures/full_fig_p070_35.png] view at source ↗
read the original abstract

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces RTSGameBench, a benchmark built on the Beyond All Reason RTS game for evaluating vision-language models on strategic reasoning tasks. It features diverse matchup structures, diagnostic mini-games targeting individual competencies, a self-evolving framework to generate new scenarios from free-form queries, and the RTSGameAgent (an FSM augmented with agentic memory) to enable VLMs to control units in large-scale settings. The central empirical claim is that multiple state-of-the-art VLMs perform poorly in matchups requiring tighter coordination, multi-agent coordination, and at increased task scales.

Significance. If the evaluation methodology properly isolates VLM reasoning limitations, the benchmark could provide a scalable, extensible testbed for diagnosing strategic deficiencies in VLMs that existing fixed RTS benchmarks do not address. The self-evolving generation and competency-targeted mini-games represent potentially useful contributions to benchmark design.

major comments (3)
  1. [RTSGameAgent and empirical validation sections] Evaluation methodology (RTSGameAgent description and VLM results sections): All VLM performance measurements route through RTSGameAgent, which decomposes control via an FSM with agentic memory. No ablation is reported that holds the agent fixed while varying the VLM, substitutes a non-VLM controller, or compares against a pure VLM baseline without the FSM layer. This prevents isolating whether observed shortfalls in coordination and scale arise from VLM strategic reasoning or from limitations in the FSM state machine and memory interface, directly undermining the central claim that VLMs 'do not perform well' due to reasoning deficiencies.
  2. [Mini-games and self-evolving generation sections] Mini-games and diagnostic assessment section: The claim that each mini-game targets an individual strategic competency (and that the self-evolving framework improves coverage) lacks supporting evidence such as inter-rater validation, correlation analysis with full-game performance, or ablation of the generation cycles. Without this, the diagnostic value of the benchmark for specific competencies cannot be assessed.
  3. [Empirical results and tables/figures] Results presentation: The abstract states an empirical validation of poor VLM performance under coordination and scale demands, yet the provided manuscript text supplies no quantitative metrics, error bars, baseline comparisons (e.g., rule-based agents or human play), or scoring details. If these are absent from the full results tables/figures, the central empirical claim lacks visible support.
minor comments (2)
  1. [RTSGameAgent implementation] Clarify the exact interface between VLM outputs and RTSGameAgent actions (e.g., how natural language plans are mapped to FSM states) to improve reproducibility.
  2. [Related work and benchmark design] Add explicit discussion of partial observability handling and how the benchmark differs quantitatively from prior RTS testbeds (e.g., in battlefield size or agent count).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on RTSGameBench. We address each major comment point by point below, providing clarifications on our methodology and indicating planned revisions where they strengthen the claims without misrepresenting the work.

read point-by-point responses
  1. Referee: [RTSGameAgent and empirical validation sections] Evaluation methodology (RTSGameAgent description and VLM results sections): All VLM performance measurements route through RTSGameAgent, which decomposes control via an FSM with agentic memory. No ablation is reported that holds the agent fixed while varying the VLM, substitutes a non-VLM controller, or compares against a pure VLM baseline without the FSM layer. This prevents isolating whether observed shortfalls in coordination and scale arise from VLM strategic reasoning or from limitations in the FSM state machine and memory interface, directly undermining the central claim that VLMs 'do not perform well' due to reasoning deficiencies.

    Authors: RTSGameAgent is presented as an enabling component specifically to allow VLMs to operate in large-scale RTS environments, where direct end-to-end control of hundreds of units under partial observability is not feasible. The reported results therefore reflect VLM strategic reasoning as mediated through this practical interface, which we view as the relevant setting for the benchmark. We agree, however, that the absence of ablations (e.g., rule-based controllers using the identical FSM/memory layer or a pure-VLM baseline) limits the ability to fully attribute shortfalls to the VLM. In the revised version we will add these comparisons and a dedicated discussion of the agent's role in isolating reasoning limitations. revision: yes

  2. Referee: [Mini-games and self-evolving generation sections] Mini-games and diagnostic assessment section: The claim that each mini-game targets an individual strategic competency (and that the self-evolving framework improves coverage) lacks supporting evidence such as inter-rater validation, correlation analysis with full-game performance, or ablation of the generation cycles. Without this, the diagnostic value of the benchmark for specific competencies cannot be assessed.

    Authors: Mini-games were constructed by mapping established RTS competencies (resource allocation, tactical positioning, multi-unit coordination, etc.) to isolated scenarios derived from Beyond All Reason mechanics. The self-evolving generator is intended to iteratively expand coverage from free-form queries. We acknowledge that the manuscript currently provides only design rationale rather than quantitative validation such as inter-rater agreement or performance correlations. We will add correlation analyses between mini-game and full-matchup results, plus a description of the generation-cycle ablation, in the revision. revision: yes

  3. Referee: [Empirical results and tables/figures] Results presentation: The abstract states an empirical validation of poor VLM performance under coordination and scale demands, yet the provided manuscript text supplies no quantitative metrics, error bars, baseline comparisons (e.g., rule-based agents or human play), or scoring details. If these are absent from the full results tables/figures, the central empirical claim lacks visible support.

    Authors: Section 5 of the full manuscript contains the quantitative results, including win-rate tables across VLMs and matchups, scale-sensitivity plots, coordination metrics, and rule-based baselines. Error bars and scoring definitions are provided in the corresponding figures and appendix. If these elements were not immediately apparent from the text, we will add explicit cross-references from the abstract and results narrative, and ensure all numerical values are highlighted in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction with no derivations or self-referential reductions

full rationale

The paper constructs and evaluates an RTS benchmark for VLMs using mini-games, a self-evolving generation process, and an FSM-based agent wrapper. No equations, parameter fits, predictions, or uniqueness theorems appear in the provided text. The central claims are empirical performance measurements on held-out matchups and mini-games; these do not reduce to the benchmark's own inputs by construction. Self-citations are absent from the load-bearing steps. The work is therefore self-contained as an empirical artifact rather than a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The benchmark rests on the domain assumption that RTS games isolate strategic competencies and that mini-games can be designed to target them individually; the RTSGameAgent is an invented component required for VLM interaction.

axioms (1)
  • domain assumption Real-time strategy games demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability, making them a natural testbed for strategic reasoning.
    This premise justifies the choice of RTS as the evaluation domain and is stated directly in the abstract.
invented entities (1)
  • RTSGameAgent no independent evidence
    purpose: Manages units via an FSM with agentic memory so VLMs can operate in large-scale RTS games.
    New component introduced in the abstract to bridge VLMs to the game environment.

pith-pipeline@v0.9.1-grok · 5771 in / 1262 out tokens · 22494 ms · 2026-06-26T20:59:49.009114+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

161 extracted references · 1 canonical work pages

  1. [1]

    In: Conference on Language Modeling (COLM) (2025)

    Ahn, D., Kim, S., Choi, J.: Society of mind meets real-time strategy: A hierarchical multi-agent framework for strategic reasoning. In: Conference on Language Modeling (COLM) (2025)

  2. [2]

    IEEE Transactions on Games (2025)

    Anne, T., Syrkis, N., Elhosni, M., Turati, F., Legendre, F., Jaquier, A., Risi, S.: Harnessing language for coordination: A framework and benchmark for llm-driven multi-agent control. IEEE Transactions on Games (2025)

  3. [3]

    In: Psychology of Learning and Motivation, vol

    Atkinson, R.C., Shiffrin, R.M.: Human memory: A proposed system and its control processes. In: Psychology of Learning and Motivation, vol. 2, pp. 89–195. Academic Press (1968)

  4. [4]

    Beyond All Reason Developers: Beyond all reason: Main game repository.https: //github.com/beyond-all-reason/Beyond-All-Reason(2019)

  5. [5]

    https://github.com/beyond-all-reason/RecoilEngine (2023), a hard fork of the SpringRTS engine (version 105 tree)

    Beyond All Reason Developers: Recoil engine: A powerful free cross-platform RTS game engine. https://github.com/beyond-all-reason/RecoilEngine (2023), a hard fork of the SpringRTS engine (version 105 tree)

  6. [6]

    https://www.beyondallreason

    Beyond All Reason Team: Beyond all reason. https://www.beyondallreason. info/(2024), open-source real-time strategy game

  7. [7]

    In: Robotics: Science and Systems (2023)

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrish- nan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N.J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J...

  8. [8]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 33, pp. 1877–1901 (2020)

  9. [9]

    Jones & Bartlett Learning (2004)

    Buckland, M.: Programming Game AI by Example. Jones & Bartlett Learning (2004)

  10. [10]

    In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI)

    Buro, M.: Real-time strategy games: A new ai research challenge. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI). pp. 1534–1535 (2003) 16 S. Kim et al

  11. [11]

    In: International Conference on Machine Learning (ICML)

    Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. In: International Conference on Machine Learning (ICML). pp. 8469–8488 (2023)

  12. [12]

    In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023),https://openreview

    Ellis, B., Cook, J., Moalla, S., Samvelyan, M., Sun, M., Mahajan, A., Foerster, J.N., Whiteson, S.: SMACv2: An improved benchmark for cooperative multi- agent reinforcement learning. In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023),https://openreview. net/forum?id=5OjLGiJW3u

  13. [13]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Fan, L., Wang, G., Jiang, Y., Mandlekar, A., Yang, Y., Zhu, H., Tang, A., Huang, D.A., Zhu, Y., Anandkumar, A.: Minedojo: Building open-ended embodied agents with internet-scale knowledge. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 18343–18362 (2022)

  14. [14]

    com/wiki/StarCraft_II(2024), accessed: 2025

    Fandom Contributors: StarCraft II — StarCraft wiki.https://starcraft.fandom. com/wiki/StarCraft_II(2024), accessed: 2025

  15. [15]

    CRC Press, Boca Raton, FL, 3rd edn

    Fullerton, T.: Game Design Workshop: A Playcentric Approach to Creating Inno- vative Games. CRC Press, Boca Raton, FL, 3rd edn. (2014)

  16. [16]

    arXiv preprint arXiv:2305.19165 (2023)

    Gandhi, K., Sadigh, D., Goodman, N.D.: Strategic reasoning with language models. arXiv preprint arXiv:2305.19165 (2023)

  17. [17]

    In: NeurIPS (2024)

    Guan, Z., Kong, X., Zhong, F., Wang, Y.: Richelieu: Self-evolving llm-based agents for ai diplomacy. In: NeurIPS (2024)

  18. [18]

    arXiv preprint arXiv:2308.00352 (2024)

    Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., Wang, Z., Yau, S.K.S., Lin, Z., Zhou, L., Ran, C., Xiao, L., Wu, C., Schmidhuber, J.: Metagpt: Meta programming for a multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 (2024)

  19. [19]

    Hu, L., Huo, M., Zhang, Y., Yu, H., Xing, E.P., Stoica, I., Rosing, T., Jin, H., Zhang, H.: lmgame-bench: How good are llms at playing games? arXiv preprint arXiv:2505.15146 (2025)

  20. [20]

    arXiv preprint arXiv:2402.01118 (2024)

    Hu, S., Huang, T., Liu, L.: Pokéllmon: A human-parity agent for pokémon battles with large language models. arXiv preprint arXiv:2402.01118 (2024)

  21. [21]

    Lua.org (2006)

    Ierusalimschy, R.: Programming in Lua. Lua.org (2006)

  22. [22]

    Jwa, S., Ahn, D., Kim, R., Kang, D., Choi, J.: Becoming experienced judges: Selective test-time learning for evaluators (2025)

  23. [23]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Khan, M.H., Sarvadevabhatla, R.K.: Sketchtopia: A dataset and foundational agents for benchmarking asynchronous multimodal communication with iconic feedback. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18176–18186 (2025)

  24. [24]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum? id=Xr73jEYG29

    Li, Z., Ni, Y., Qi, R., Lu, C., Jiang, L., Xiaojie, X., Liu, X., Li, P., Guo, Y., Ma, Z., Li, H., wu hui, Xian, G., Huang, K., Zhang, X.: LLM-pySC2: Starcraft II learning environment for large language models. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum? id=Xr73jEYG29

  25. [25]

    arXiv preprint arXiv:2310.05036 (2023)

    Light, J., Cai, M., Shen, S., Hu, Z.: Avalonbench: Evaluating llms playing the game of avalon. arXiv preprint arXiv:2310.05036 (2023)

  26. [26]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Lin, W., Roberts, J., Yang, Y., Albanie, S., Lu, Z., Han, K.: GAMEBoT: Transparent assessment of LLM reasoning in games. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7656–

  27. [27]

    acl-long.378/ Beyond All Reason: Benchmarking Strategic Reasoning in VLMs 17

    Association for Computational Linguistics, Vienna, Austria (2025).https: //doi.org/10.18653/v1/2025.acl-long.378, https://aclanthology.org/2025. acl-long.378/ Beyond All Reason: Benchmarking Strategic Reasoning in VLMs 17

  28. [28]

    arXiv preprint arXiv:2412.05255 (2024)

    Long, Q., Li, Z., Gong, R., Wu, Y.N., Terzopoulos, D., Gao, X.: Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft. arXiv preprint arXiv:2412.05255 (2024)

  29. [29]

    arXiv preprint arXiv:2503.05383 (2025)

    Ma, W., Fu, Y., Zhang, Z., Ghanem, B., Li, G.: Ava: Attentive vlm agent for mastering starcraft ii. arXiv preprint arXiv:2503.05383 (2025)

  30. [30]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=kEPpD7yETM

    Ma, W., Mi, Q., Zeng, Y., Yan, X., Lin, R., Wu, Y., Wang, J., Zhang, H.: Large language models play starcraft II:benchmarks and a chain of summarization ap- proach. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=kEPpD7yETM

  31. [31]

    In: NeurIPS (2023)

    Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al.: Self-refine: Iterative refinement with self-feedback. In: NeurIPS (2023)

  32. [32]

    IEEE Transactions on Computational Intelligence and AI in Games5(4), 293–311 (2013)

    Ontañón, S., Synnaeve, G., Uriarte, A., Richoux, F., Churchill, D., Preuss, M.: A survey of real-time strategy game ai research and competition in starcraft. IEEE Transactions on Computational Intelligence and AI in Games5(4), 293–311 (2013)

  33. [33]

    arXiv preprint arXiv:2303.08774 (2023)

    OpenAI: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  34. [34]

    OpenAI: Introducing GPT-5.2.https://openai.com/index/introducing-gpt-5- 2/(December 2025), accessed: 2025-12-11

  35. [35]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 27730–27744 (2022)

  36. [36]

    arXiv preprint arXiv:2310.08560 (2023)

    Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S.G., Stoica, I., Gonzalez, J.E.: MemGPT: Towards llms as operating systems. arXiv preprint arXiv:2310.08560 (2023)

  37. [37]

    arXiv preprint arXiv:2411.13543 (2024)

    Paglieri, D., Cupiał, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., Pignatelli, E., Kuciński, Ł., Pinto, L., Fergus, R., et al.: Balrog: Benchmarking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543 (2024)

  38. [38]

    arXiv preprint arXiv:2506.03610 (2025)

    Park, D., Kim, M., Choi, B., Kim, J., Lee, K., Lee, J., Park, I., Lee, B.U., Hwang, J., Ahn, J., et al.: Orak: A foundational benchmark for training and evaluating llm agents on diverse video games. arXiv preprint arXiv:2506.03610 (2025)

  39. [39]

    In: ACM Symposium on User Interface Software and Technology (UIST)

    Park, J.S., O’Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Generative agents: Interactive simulacra of human behavior. In: ACM Symposium on User Interface Software and Technology (UIST). pp. 1–22 (2023)

  40. [40]

    arXiv preprint arXiv:2401.10568 (2024)

    Qi, S., Chen, S., Li, Y., Kong, X., Wang, J., Yang, B., Wong, P., Zhong, Y., Zhang, X., Zhang, Z., et al.: Civrealm: A learning and reasoning odyssey in civilization for decision-making agents. arXiv preprint arXiv:2401.10568 (2024)

  41. [41]

    arXiv preprint arXiv:2307.07924 (2024)

    Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X., Xu, J., Li, D., Liu, Z., Sun, M.: Chatdev: Communicative agents for software development. arXiv preprint arXiv:2307.07924 (2024)

  42. [42]

    https : / / newqualitipedia

    Qualitipedia contributors: Beyond all reason. https : / / newqualitipedia . telepedia.net/wiki/Beyond_All_Reason(2024)

  43. [43]

    Journal of Machine Learning Research21(140), 1–67 (2020)

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research21(140), 1–67 (2020)

  44. [44]

    AI Magazine 35(4), 75–104 (2014)

    Robertson, G., Watson, I.: A review of real-time strategy game ai. AI Magazine 35(4), 75–104 (2014)

  45. [45]

    arXiv preprint arXiv:2508.10428 (2025) 18 S

    Shen, P., Wang, Y., Mu, N., Luan, Y., Xie, R., Yang, S., Wang, L., Hu, H., Xu, S., Yang, Y., et al.: Sc2arena and starevolve: Benchmark and self-improvement frame- work for llms in complex decision-making tasks. arXiv preprint arXiv:2508.10428 (2025) 18 S. Kim et al

  46. [46]

    In: NeurIPS (2023)

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Language agents with verbal reinforcement learning. In: NeurIPS (2023)

  47. [47]

    Psychological Review99(2), 195–231 (1992)

    Squire, L.R.: Memory and the hippocampus: A synthesis from findings with rats, monkeys, and humans. Psychological Review99(2), 195–231 (1992)

  48. [48]

    arXiv preprint arXiv:2307.07947 (2023)

    Tan,S., Ivanovic,B., Weng, X., Pavone, M., Kraehenbuehl,P.:Language conditioned traffic generation. arXiv preprint arXiv:2307.07947 (2023)

  49. [49]

    arXiv preprint arXiv:2503.06047 (2025)

    Tang, W., Zhou, Y., Xu, E., Cheng, K., Li, M., Xiao, L.: Dsgbench: A diverse strategic game benchmark for evaluating llm-based agents in complex decision- making environments. arXiv preprint arXiv:2503.06047 (2025)

  50. [50]

    arXiv preprint arXiv:2302.13971 (2023)

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  51. [51]

    Nature575(7782), 350–354 (2019)

    Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al.: Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature575(7782), 350–354 (2019)

  52. [52]

    arXiv preprint arXiv:2601.05899 (2026)

    Wang, D., Zhou, C., Zhao, D., Liu, X., Ma, M.C., Ushaw, G., Davison, R.: Tower- mind: A tower defence game learning environment and benchmark for llm as agents. arXiv preprint arXiv:2601.05899 (2026)

  53. [53]

    In: Proceedings of the 31st International Conference on Computational Linguistics

    Wang, S., Long, Z., Fan, Z., Huang, X., Wei, Z.: Benchmark self-evolving: A multi-agent framework for dynamic LLM evaluation. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 3310–3328. Association for Computational Linguistics, Abu Dhabi, UAE (2025),https://aclanthology. org/2025.coling-main.223/

  54. [54]

    Wang, Y., Liu, S., Fang, J., Meng, Z.: Evoagentx: An automated framework for evolving agentic workflows (2025),https://arxiv.org/abs/2507.03616

  55. [55]

    arXiv preprint arXiv:2503.10042 (2025)

    Wang, Z., Dong, Y., Luo, F., Ruan, M., Cheng, Z., Chen, C., Li, P., Liu, Y.: Escapecraft: A 3d room escape environment for benchmarking complex multimodal reasoning ability. arXiv preprint arXiv:2503.10042 (2025)

  56. [56]

    In: COLM (2024)

    Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., et al.: Autogen: Enabling next-gen llm applications via multi-agent conversation. In: COLM (2024)

  57. [57]

    arXiv preprint arXiv:2502.12110 (2025)

    Xu, W., Liang, Z., Mei, K., Gao, H., Tan, J., Zhang, Y.: A-MEM: Agentic memory for LLM agents. arXiv preprint arXiv:2502.12110 (2025)

  58. [58]

    arXiv preprint arXiv:2506.02387 (2025)

    Xu, Z., Xu, Z., Yi, X., Yuan, H., Chen, X., Wu, Y., Yu, C., Wang, Y.: Vs-bench: Evaluating vlms for strategic reasoning and decision-making in multi-agent envi- ronments. arXiv preprint arXiv:2506.02387 (2025)

  59. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhang, J., Xu, C., Li, B.: Chatscene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15459–15469 (2024)

  60. [60]

    arXiv preprint arXiv:2603.01562 (2026)

    Zhang, Q., Zhou, J., Wang, Y., Lyu, F., Ming, Y., Xu, C., Sun, Q., Zheng, K., Kang, P., Liu, X., Ma, C.: Rubricbench: Aligning model-generated rubrics with human standards. arXiv preprint arXiv:2603.01562 (2026)

  61. [61]

    In: Conference on Language Modeling (COLM) (2024)

    Zhang, Y., Mao, S., Ge, T., Wang, X., de Wynter, A., Xia, Y., Wu, W., Song, T., Lan, M., Wei, F.: Llm as a mastermind: A survey of strategic reasoning with large language models. In: Conference on Language Modeling (COLM) (2024)

  62. [62]

    arXiv preprint arXiv:2504.06148 (2025) Beyond All Reason: Benchmarking Strategic Reasoning in VLMs 19

    Zheng, X., Li, L., Yang, Z., Yu, P., Wang, A.J., Yan, R., Yao, Y., Wang, L.: V-mage: A game evaluation framework for assessing vision-centric capabilities in multimodal large language models. arXiv preprint arXiv:2504.06148 (2025) Beyond All Reason: Benchmarking Strategic Reasoning in VLMs 19

  63. [63]

    F r o n t l i n e

    Zheng, X., Lin, H., He, K., Wang, Z., Fu, Q., Fu, H., Zheng, Z., Liang, Y.: Mcu: An evaluation framework for open-ended game agents. In: Forty-second International Conference on Machine Learning (2025) Supplementary Material for R TSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models San Kim*1 , Daechul Ahn*1 , Reokyoung Kim1 , H...

  64. [64]

    Updat e t he memor y b y maintaining t he FINAL list of e xperiences t hat should be r etained

    New triggers: R ecent game e v ent s t hat ma y contain v aluable inf ormation Y our job is t o: 1 . Updat e t he memor y b y maintaining t he FINAL list of e xperiences t hat should be r etained

  65. [65]

    cont ent

    R etrie v e r ele v ant memories t hat ar e useful f or t he curr ent decision-making F or memor y updat es, y ou ma y: - K eep e xisting e xperiences t hat ar e still r ele v ant - R emo v e out dat ed or r edundant e xperiences - Add new e xperiences deriv ed fr om triggers - Mer ge similar e xperiences int o one F or each e xperience in updat e_ memor ...

  66. [66]

    updat e_ memor y

    Batt le out comes and t heir causes 3 . Strat egic decisions and t heir r esult s 4 . T errit orial contr ol changes Output f ormat: { "updat e_ memor y": [ {"cont ent": "e xperience sent ence" , "tag": "cat egor y _tag" , "time": game_time}, ... ], "r etrie v e_ memor y": [ {"cont ent": "e xperience sent ence" , "tag": "cat egor y _tag" , "time": game_ti...

  67. [67]

    w orld": {

    Det ermine t he unit pr oduction queue f or each f act or y 3 . Or ganiz e all militar y unit s int o tactical squads, assign each squad a mo v ement tar get Description: {{G AME_DESCRIPTION}} Kno wledge: {{G AME_KNOWLEDGE}} IMA GE INPUT FORMA T : - First image: Minimap view of t he entir e batt lefield - Second image: Scr een view ( on-scr een ar ea, cen...

  68. [68]

    Start here

    'designer': Creates the Game Design (GDD JSON). Start here

  69. [69]

    Run this after Design is valid

    'rule_developer': Writes Lua code for Rules defined by the Designer. Run this after Design is valid

  70. [70]

    Run this ONLY after Rules are valid

    'script_developer': Generates the final Scenario Script (JSON/Text). Run this ONLY after Rules are valid

  71. [71]

    'finish': All steps are complete and valid. [Current State] - User Intent: {user_intent} - GDD Exists: {has_gdd} - Rules Verified: {rules_valid} - Final Script Exists: {has_final_script} - Last Node Visited: {last_node} - Is Valid: {is_valid} [Feedback History] {feedback_history_summary} [Your Tasks]

  72. [72]

    ANALYZE the feedback history from all agents (designer, rule_developer, script_developer)

  73. [73]

    IDENTIFY which stage failed and why

  74. [74]

    DECIDE the next_node to route to

  75. [75]

    next_node

    SYNTHESIZE a concise meta_feedback that summarizes: - What went wrong (if anything) - Which agent(s) had issues - Key error patterns or trends - Actionable guidance for the next agent [Routing Rules] - If GDD is missing or has invalid feedback -> Route to 'designer' - If GDD is valid but Rules failed verification -> Route to 'rule_developer' - If Rules ar...

  76. [76]

    **Intent Interpretation**: What did the user actually want? What core gameplay fantasy or experience were they asking for?

  77. [77]

    **Design Rationale**: Why were these specific mechanics, rules, and rules chosen to fulfill that intent? What design principles guided the choices?

  78. [78]

    **Key Decisions**: Explain the most important creative choices — map selection, unit composition, custom rules, win conditions — and why each one serves the user's goal

  79. [79]

    updates": []`). Most workflows should result in zero updates. You may ONLY use the `

    **How It Plays**: Briefly describe what a player would actually experience in this GDD and why that matches what the user asked for. Do NOT include: - Internal system details (agent names, validation loops, feedback rounds) - Technical implementation steps or debugging history - Repetitive restatements of the game description ## INPUTS [User's Original Qu...

  80. [80]

    A single error does NOT justify any change

    **IDENTIFY PATTERNS**: Look for the SAME type of failure appearing across 2+ separate feedback entries. A single error does NOT justify any change

Showing first 80 references.