SAGA: Scene-Aware, Goal-Evolving Agents for Long-Horizon CivRealm Strategy Planning

Liuyu Xiang; Shuo Chen; Tianyu Jin; Yexin Li; Yida Wang; Yingzhuo Liu; Zhaofeng He; Zhiyao Jiang

arxiv: 2606.29932 · v1 · pith:G2DSQWECnew · submitted 2026-06-29 · 💻 cs.AI

SAGA: Scene-Aware, Goal-Evolving Agents for Long-Horizon CivRealm Strategy Planning

Tianyu Jin , Shuo Chen , Yida Wang , Liuyu Xiang , Yingzhuo Liu , Zhiyao Jiang , Yexin Li , Zhaofeng He This is my paper

Pith reviewed 2026-06-30 06:19 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsmulti-agent planningstrategy gamesscene graphslong-horizon reasoningFreeCivgoal evolutioninfrastructure construction

0 comments

The pith

SAGA's three mechanisms let LLM agents reach higher civilization scores in FreeCiv by fixing scene blindness, context overload, and isolated learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SAGA as an LLM multi-agent system for long-horizon strategy planning in games like FreeCiv, where agents must handle multiple domains under sparse rewards and imperfect information. Existing LLM agents fail at spatial understanding from raw coordinates, suffer context overflow from full state dumps that couple domains, and learn nothing across separate episodes. SAGA counters these with a Map-Semantic Scene Graph that turns spatial relations into per-unit language, a Tool-Augmented Planner that requests domain state on demand and routes directives to specialists, and a Dual-Horizon Feedback Loop that generates goals inside games while running causal post-mortems across games. In FreeCiv evaluations SAGA records the highest mean score with lower variance, leads uniquely on infrastructure, beats the top baselines in most direct matches, and cuts output tokens by 27 percent; adding the cross-game module yields the best chained performance over five episodes.

Core claim

SAGA attains the highest mean civilization score with lower variance than the two strongest baselines, is the only method that significantly surpasses every baseline on infrastructure construction, outscores the two strongest baselines in most head-to-head games while cutting output tokens by 27 percent, and equipped with the cross-game evolution module reaches the highest end-of-chain score across five successive episodes.

What carries the argument

Map-Semantic Scene Graph, Tool-Augmented Planner, and Dual-Horizon Feedback Loop, each addressing one stated failure mode in LLM strategy agents.

If this is right

Infrastructure construction improves without trade-offs against other objectives.
Output token count drops by 27 percent while head-to-head wins increase.
Cross-game causal post-mortems enable rising scores over successive episodes without manual reward design.
Each of the three components contributes measurably on its own, per the reported ablations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The scene-graph approach might transfer to other spatial or imperfect-information settings where raw coordinates cause planning errors.
On-demand domain tools could lower the risk of constraint violations in any multi-domain LLM controller.
Structured cross-episode post-mortems offer a route to continual adaptation that does not rely on environment-specific rewards.

Load-bearing premise

The three mechanisms each directly and independently fix one failure mode and that FreeCiv metrics plus baselines isolate those fixes without prompt or environment confounds.

What would settle it

An ablation that removes one mechanism yet shows no matching drop in the corresponding metric, or a new baseline that matches SAGA scores without any of the three components.

Figures

Figures reproduced from arXiv: 2606.29932 by Liuyu Xiang, Shuo Chen, Tianyu Jin, Yexin Li, Yida Wang, Yingzhuo Liu, Zhaofeng He, Zhiyao Jiang.

**Figure 1.** Figure 1: FreeCiv as a complex multi-domain strategy benchmark. (a) CivRealm uniquely combines long credit-assignment horizons with maximal domain coupling. (b) Domain-level breakdown for three representative environments. Dual-Horizon Feedback Loop—that addresses five fundamental failure modes in long-horizon multi-domain planning: scene grounding, context scalability, domain decoupling, constraint awareness, an… view at source ↗

**Figure 2.** Figure 2: SAGA System Architecture. Left: FreeCiv game state at turn t with three decision points: ⃝A FullLLMObsWrapper converts raw game state into structured natural language; ⃝B Map-Semantic SceneGraph G=(V, E) encodes friendly/enemy spatial relations (distance, direction, threat edges) as concise relational context for the planner (§4.2); ⃝C the ReActAgent Planner calls on-demand tools and dispatches a structur… view at source ↗

**Figure 3.** Figure 3: Cross-game Evolution & Multi-Seed Holistic Validation. (a) Left: All methods are equipped with the same cross-game evolution module. While all methods improve from Game 1 to Game 5, baseline trajectories exhibit pronounced inter-episode oscillation. SAGA (solid red) reaches the highest score at Game 5 (90)—the highest across all method-stage combinations—and leads the field throughout, indicating its archi… view at source ↗

**Figure 4.** Figure 4: GPT-4o-mini evolution trajectory on Map Seed 2029. Final Score across the five-game evolution chain (G1→G5) on the resource-poor map. SAGA (solid red) dominates both baselines at every step and peaks at G3 (Score 61) before regressing as it lapses back into Despotism. The weak backbone makes gains non-monotonic, yet SAGA never drops into the oscillating low band occupied by HIMA and CoS—structured executi… view at source ↗

read the original abstract

Long-horizon strategic planning in complex strategy games demands concurrent reasoning across multiple decision domains under imperfect information and sparse reward. Existing LLM-based agents suffer from three systematic failures: scene blindness from raw tile coordinates, context overflow and domain coupling from monolithic state dumps, and shallow cross-game learning that treats each episode in isolation. We present SAGA, an LLM multi-agent framework with three mechanisms each directly targeting one class of failure: (i) a Map-Semantic Scene Graph that encodes typed spatial relations among game entities into per-unit natural-language context, resolving spatial blindness without global token inflation; (ii) a Tool-Augmented Planner that pulls fine-grained domain state on demand and dispatches per-domain directives to dedicated specialist controllers, eliminating context overflow, domain coupling, and mechanical constraint violations; and (iii) a Dual-Horizon Feedback Loop that combines periodic within-game goal generation with structured cross-game causal post-mortem, enabling principled strategic evolution without manual reward engineering. Evaluated on FreeCiv, SAGA attains the highest mean civilization score -- the environment's sole sparse objective reward -- with lower variance than the two strongest baselines, and is the only method that significantly surpasses every baseline on infrastructure construction, the resource axis most readily sacrificed under multi-objective conflict. It outscores the two strongest baselines in most head-to-head games while cutting output tokens (the dominant decoding cost) by 27%. Equipped with the cross-game evolution module, SAGA reaches the highest end-of-chain score across five successive episodes. Ablation studies confirm that each architectural component contributes independently to this advantage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGA reports solid empirical gains on FreeCiv from a scene-graph plus tool-use plus cross-episode loop setup, but the attribution to those three modules still needs the ablation details to rule out prompt confounds.

read the letter

The paper's core offering is a three-part LLM agent for FreeCiv that turns raw map data into a typed scene graph, lets the planner call tools for domain-specific state instead of dumping everything, and adds a within-game plus cross-game feedback loop for goal evolution. On the reported numbers this beats the strongest baselines on mean score and infrastructure while using 27% fewer output tokens and showing lower variance.

The specific combination of those three pieces is new in the cited literature. The authors do a clean job naming the three failure modes (spatial blindness, context bloat, isolated episodes) and pairing each with one mechanism, which makes the design easy to follow. The cross-game module reaching higher end-of-chain scores over five episodes is the part that feels most practically useful.

The main soft spot is causal attribution. The abstract says ablations show each component contributes independently, but the stress-test concern is real: without a matched-prompt baseline or a full factorial design that crosses the modules while holding token budget and prompt structure fixed, some of the score and variance gains could come from overall prompt quality rather than the scene graph or post-mortem loop. If the full paper only has leave-one-out ablations without those controls, the claim that each mechanism fixes its target failure mode stays provisional.

This is for people working on LLM agents for long-horizon sparse-reward games or planning. It is worth sending to peer review because the architecture is concrete, the benchmark is standard, and the reported metrics are falsifiable even if the current evidence for the mechanisms is not yet airtight.

Referee Report

2 major / 2 minor

Summary. The paper introduces SAGA, an LLM-based multi-agent framework for long-horizon strategy planning in CivRealm (FreeCiv). It identifies three failure modes in existing agents (scene blindness from raw coordinates, context overflow/domain coupling from monolithic states, and shallow cross-game learning) and proposes three targeted mechanisms: (i) a Map-Semantic Scene Graph encoding typed spatial relations into per-unit natural-language context, (ii) a Tool-Augmented Planner that fetches domain-specific state on demand and dispatches to specialist controllers, and (iii) a Dual-Horizon Feedback Loop combining within-game goal generation with cross-game causal post-mortems. On FreeCiv, SAGA reports the highest mean civilization score with lower variance than strong baselines, is the only method to significantly outperform all baselines on infrastructure construction, reduces output tokens by 27%, wins most head-to-head games against top baselines, and achieves the highest end-of-chain score over five successive episodes when using the evolution module. Ablations are stated to confirm independent contributions from each component.

Significance. If the empirical claims hold under rigorous controls, SAGA would provide a concrete, modular architecture for improving LLM agents on sparse-reward, multi-domain, long-horizon tasks. The explicit mapping of mechanisms to failure modes, the use of a standard strategy-game benchmark, and the reported gains in both performance and token efficiency would be useful reference points for the community. The cross-game evolution component is particularly noteworthy as an attempt at principled meta-learning without manual reward design.

major comments (2)

[Experiments / Ablation studies] Experiments / Ablation studies: The central attribution—that each of the three mechanisms independently resolves one failure mode and that the observed gains are not due to overall prompt structure or baseline prompt quality—requires a factorial ablation design or at least matched-prompt controls that vary only the claimed component. The manuscript does not appear to report such controls, leaving open the possibility that the reported superiority on civilization score, infrastructure, variance, and token count arises from differences in prompt engineering rather than the scene-graph or post-mortem modules specifically.
[Evaluation protocol] Evaluation protocol: No details are supplied on the number of independent runs, statistical tests (e.g., significance thresholds for “significantly surpasses every baseline”), random seeds, or exact baseline implementations and prompt templates. Without these, it is impossible to verify that the head-to-head wins, infrastructure gains, and 27% token reduction are reproducible and attributable to the claimed mechanisms rather than environment-specific tuning.

minor comments (2)

[Abstract / Introduction] The abstract and introduction should explicitly state the total number of episodes, the precise definition of “civilization score,” and the token-budget matching procedure used for baselines.
[Methods] Notation for the scene graph (typed relations, per-unit context construction) and the exact interface of the Tool-Augmented Planner would benefit from a small diagram or pseudocode in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for stronger experimental controls and reproducibility details. We address each major comment below and will revise the manuscript to incorporate additional ablations and protocol information.

read point-by-point responses

Referee: [Experiments / Ablation studies] The central attribution—that each of the three mechanisms independently resolves one failure mode and that the observed gains are not due to overall prompt structure or baseline prompt quality—requires a factorial ablation design or at least matched-prompt controls that vary only the claimed component. The manuscript does not appear to report such controls, leaving open the possibility that the reported superiority on civilization score, infrastructure, variance, and token count arises from differences in prompt engineering rather than the scene-graph or post-mortem modules specifically.

Authors: We agree that a factorial design or matched-prompt controls would provide stronger isolation of each mechanism's contribution. Our existing ablations remove one component at a time from the full SAGA framework while preserving overall structure, and these show independent gains; however, they do not fully rule out prompt-engineering confounds. In the revision we will add a new set of matched-prompt controls that vary only the targeted component (e.g., scene-graph encoding versus equivalent-length raw-coordinate prompts) and report the resulting performance deltas. revision: yes
Referee: [Evaluation protocol] No details are supplied on the number of independent runs, statistical tests (e.g., significance thresholds for “significantly surpasses every baseline”), random seeds, or exact baseline implementations and prompt templates. Without these, it is impossible to verify that the head-to-head wins, infrastructure gains, and 27% token reduction are reproducible and attributable to the claimed mechanisms rather than environment-specific tuning.

Authors: We acknowledge the omission of these reproducibility details. The revised manuscript will include: the exact number of independent runs and random seeds used, the statistical tests and significance thresholds applied to claims of outperformance, and the full prompt templates plus baseline implementation details placed in an appendix. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes an empirical LLM multi-agent framework evaluated via head-to-head experiments and ablations on FreeCiv. Claims rest on observed scores, variance reductions, token savings, and component contributions rather than any mathematical derivation, equations, fitted parameters renamed as predictions, or self-citation chains. No load-bearing steps reduce to inputs by construction; the work is self-contained against external environment metrics and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The paper's central claims rest on three newly introduced mechanisms whose correctness is not independently verified outside the reported experiments; no free parameters or invented physical entities are described.

axioms (2)

domain assumption Natural-language descriptions of typed spatial relations suffice for LLM spatial reasoning in tile-based games
Invoked by the Map-Semantic Scene Graph component
domain assumption Per-domain specialist controllers can be dispatched without introducing new coupling or constraint violations
Invoked by the Tool-Augmented Planner

invented entities (3)

Map-Semantic Scene Graph no independent evidence
purpose: Encodes typed spatial relations among game entities into per-unit natural-language context
New component introduced to resolve scene blindness
Tool-Augmented Planner no independent evidence
purpose: Pulls fine-grained domain state on demand and dispatches per-domain directives
New component introduced to resolve context overflow and domain coupling
Dual-Horizon Feedback Loop no independent evidence
purpose: Combines periodic within-game goal generation with structured cross-game causal post-mortem
New component introduced to enable strategic evolution

pith-pipeline@v0.9.1-grok · 5843 in / 1664 out tokens · 40488 ms · 2026-06-30T06:19:56.583002+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 2 canonical work pages

[1]

CivRealm: A learning and reasoning odyssey in Civilization for decision-making agents

Siyuan Qi et al. CivRealm: A learning and reasoning odyssey in Civilization for decision-making agents. In Proc. ICLR, 2024

2024
[2]

Large language models play StarCraft II: Benchmarks and a chain of summarization approach

Weiyu Ma et al. Large language models play StarCraft II: Benchmarks and a chain of summarization approach. InAdvances in NeurIPS, 2024

2024
[3]

LLMs are not good strate- gists, yet memory-enhanced agency boosts reasoning

Yi Wu and Zhimin Hu. LLMs are not good strate- gists, yet memory-enhanced agency boosts reasoning. InICLR Workshop on Reasoning and Planning for LLMs, 2025

2025
[4]

Society of mind meets real-time strategy: A hierarchical multi- agent framework for strategic reasoning.arXiv preprint arXiv:2408.15567, 2025

Daechul Ahn, San Kim, and Jonghyun Choi. Society of mind meets real-time strategy: A hierarchical multi- agent framework for strategic reasoning.arXiv preprint arXiv:2408.15567, 2025

work page arXiv 2025
[5]

Optimus-2: Multimodal Minecraft agent with goal-observation-action conditioned policy

Zaijing Li et al. Optimus-2: Multimodal Minecraft agent with goal-observation-action conditioned policy. InProc. CVPR, 2025

2025
[6]

EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2412.04843, 2025

Rong Wu et al. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2412.04843, 2025

work page arXiv 2025
[7]

SE-Agent: Self-evolution trajectory optimization in multi-step reasoning with LLM-based agents

Yifu Guo et al. SE-Agent: Self-evolution trajectory optimization in multi-step reasoning with LLM-based agents. InAdvances in NeurIPS, 2025

2025
[8]

SiriuS: Self-improving multi-agent systems via bootstrapped reasoning

Wanjia Zhao, Mert Yuksekgonul, Shirley Wu, and James Zou. SiriuS: Self-improving multi-agent systems via bootstrapped reasoning. InAdvances in NeurIPS, 2025

2025
[9]

The Hanabi challenge: A new fron- tier for AI research.Artificial Intelligence, 280:103216, 2020

Nolan Bard et al. The Hanabi challenge: A new fron- tier for AI research.Artificial Intelligence, 280:103216, 2020

2020
[10]

Leibo et al

Joel Z. Leibo et al. Scalable evaluation of multi-agent re- inforcement learning with Melting Pot. InProc. ICML, pp. 6187–6199, 2021

2021
[11]

Dota 2 with large scale deep re- inforcement learning

Christopher Berner et al. Dota 2 with large scale deep re- inforcement learning. Technical report, OpenAI, 2019

2019
[12]

Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575(7782):350–354, 2019

Oriol Vinyals et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575(7782):350–354, 2019

2019
[13]

MineDojo: Building open-ended embod- ied agents with internet-scale knowledge

Linxi Fan et al. MineDojo: Building open-ended embod- ied agents with internet-scale knowledge. InAdvances in NeurIPS, 2022

2022
[14]

No-press Diplomacy: Modeling multi-agent gameplay

Philip Paquette et al. No-press Diplomacy: Modeling multi-agent gameplay. InAdvances in NeurIPS, 32, 2019

2019
[15]

WebArena: A realistic web environ- ment for building autonomous agents

Shuyan Zhou et al. WebArena: A realistic web environ- ment for building autonomous agents. InProc. ICLR, 2024

2024
[16]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environ- ments

Tianbao Xie et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environ- ments. InAdvances in NeurIPS, 2024

2024
[17]

GAMA-Bench: Evaluating LLMs’ gaming ability in multi-agent environments

Jen-tse Huang et al. GAMA-Bench: Evaluating LLMs’ gaming ability in multi-agent environments. InProc. ICLR, 2025

2025
[18]

Xu et al

Frank F. Xu et al. TheAgentCompany: Benchmarking LLM agents on consequential real-world tasks. InAd- vances in NeurIPS, 2025

2025
[19]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao et al. ReAct: Synergizing reasoning and acting in language models. InProc. ICLR, 2023

2023
[20]

V oyager: An open-ended embod- ied agent with large language models

Guanzhi Wang et al. V oyager: An open-ended embod- ied agent with large language models. InAdvances in NeurIPS, 2023

2023
[21]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong et al. MetaGPT: Meta programming for a multi-agent collaborative framework. InProc. ICLR, 2024

2024
[22]

CoLLAB: Coordinating collabo- rative LLM-based agents for mathematical reasoning

Shanshan Gong et al. CoLLAB: Coordinating collabo- rative LLM-based agents for mathematical reasoning. InAdvances in NeurIPS, 2025

2025
[23]

EmbodiedBench: Comprehensive bench- marking multi-modal large language models for embod- ied decision making

Rui Yang et al. EmbodiedBench: Comprehensive bench- marking multi-modal large language models for embod- ied decision making. InProc. ICML, 2025

2025
[24]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn et al. Reflexion: Language agents with verbal reinforcement learning. InAdvances in NeurIPS, 2023

2023
[25]

lost in the middle

Aman Madaan et al. Self-Refine: Iterative refinement with self-feedback. InAdvances in NeurIPS, 2023. Technical Appendix A Interactive Environment Comparison Table 6 provides a structured comparison of representative interactive decision-making environments across ten criteria relevant to complex strategy research. CivRealm is the only environment satisfy...

2023
[26]

Diplomatic Events -- check first: wars declared, treaties, alliances. 0.5. Threat Analysis -- lost city = CRITICAL; distinguish combat vs. non-combat units before raising military alert
[27]

Root Cause -- tax rate, resource allocation, unit utilization
[28]

URGENT RETRY: <Goal>

Production & Garrison Verification -- compare city status to goals. Goal status: Completed / In Progress / Failed / Failed (Timeout). Failed goal MUST trigger “URGENT RETRY: <Goal>” in corrective_actions. Output JSON: {thoughts: {summary, corrective_actions}, goal_evaluations: {...}} J.8 Game Analyst (Post-Game Forensic Analyst) Invoked once at game end. ...
[29]

Garrison & Defense: coverage, Wall-Unit synergy, redundant units
[30]

Production: mix (Settler/Military/Infrastructure), analysis
[31]

Worker Tile Improvement: road network, irrigation timing, idle workers
[32]

Tax Rate: Luxury timing (wasted under Despotism), disorder response
[33]

Technology Path: full chronological sequence, gov-prerequisite trace
[34]

Government Transition: Despotism penalty quantification, transition pace
[35]

Zero-Growth Economy

Over-Militarization vs. Zero-Growth Economy
[36]

Recommendations must form a coherent Rapid Development Pipeline: City founding -> precise defense (Walls + military unit) -> core infrastructure

Expansion Stagnation: pace, city spacing, missed safe-land settlement. Recommendations must form a coherent Rapid Development Pipeline: City founding -> precise defense (Walls + military unit) -> core infrastructure. Output JSON: GameAnalysisResponse {final_state, success,root_causes, recommendations, ...}

[1] [1]

CivRealm: A learning and reasoning odyssey in Civilization for decision-making agents

Siyuan Qi et al. CivRealm: A learning and reasoning odyssey in Civilization for decision-making agents. In Proc. ICLR, 2024

2024

[2] [2]

Large language models play StarCraft II: Benchmarks and a chain of summarization approach

Weiyu Ma et al. Large language models play StarCraft II: Benchmarks and a chain of summarization approach. InAdvances in NeurIPS, 2024

2024

[3] [3]

LLMs are not good strate- gists, yet memory-enhanced agency boosts reasoning

Yi Wu and Zhimin Hu. LLMs are not good strate- gists, yet memory-enhanced agency boosts reasoning. InICLR Workshop on Reasoning and Planning for LLMs, 2025

2025

[4] [4]

Society of mind meets real-time strategy: A hierarchical multi- agent framework for strategic reasoning.arXiv preprint arXiv:2408.15567, 2025

Daechul Ahn, San Kim, and Jonghyun Choi. Society of mind meets real-time strategy: A hierarchical multi- agent framework for strategic reasoning.arXiv preprint arXiv:2408.15567, 2025

work page arXiv 2025

[5] [5]

Optimus-2: Multimodal Minecraft agent with goal-observation-action conditioned policy

Zaijing Li et al. Optimus-2: Multimodal Minecraft agent with goal-observation-action conditioned policy. InProc. CVPR, 2025

2025

[6] [6]

EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2412.04843, 2025

Rong Wu et al. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2412.04843, 2025

work page arXiv 2025

[7] [7]

SE-Agent: Self-evolution trajectory optimization in multi-step reasoning with LLM-based agents

Yifu Guo et al. SE-Agent: Self-evolution trajectory optimization in multi-step reasoning with LLM-based agents. InAdvances in NeurIPS, 2025

2025

[8] [8]

SiriuS: Self-improving multi-agent systems via bootstrapped reasoning

Wanjia Zhao, Mert Yuksekgonul, Shirley Wu, and James Zou. SiriuS: Self-improving multi-agent systems via bootstrapped reasoning. InAdvances in NeurIPS, 2025

2025

[9] [9]

The Hanabi challenge: A new fron- tier for AI research.Artificial Intelligence, 280:103216, 2020

Nolan Bard et al. The Hanabi challenge: A new fron- tier for AI research.Artificial Intelligence, 280:103216, 2020

2020

[10] [10]

Leibo et al

Joel Z. Leibo et al. Scalable evaluation of multi-agent re- inforcement learning with Melting Pot. InProc. ICML, pp. 6187–6199, 2021

2021

[11] [11]

Dota 2 with large scale deep re- inforcement learning

Christopher Berner et al. Dota 2 with large scale deep re- inforcement learning. Technical report, OpenAI, 2019

2019

[12] [12]

Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575(7782):350–354, 2019

Oriol Vinyals et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575(7782):350–354, 2019

2019

[13] [13]

MineDojo: Building open-ended embod- ied agents with internet-scale knowledge

Linxi Fan et al. MineDojo: Building open-ended embod- ied agents with internet-scale knowledge. InAdvances in NeurIPS, 2022

2022

[14] [14]

No-press Diplomacy: Modeling multi-agent gameplay

Philip Paquette et al. No-press Diplomacy: Modeling multi-agent gameplay. InAdvances in NeurIPS, 32, 2019

2019

[15] [15]

WebArena: A realistic web environ- ment for building autonomous agents

Shuyan Zhou et al. WebArena: A realistic web environ- ment for building autonomous agents. InProc. ICLR, 2024

2024

[16] [16]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environ- ments

Tianbao Xie et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environ- ments. InAdvances in NeurIPS, 2024

2024

[17] [17]

GAMA-Bench: Evaluating LLMs’ gaming ability in multi-agent environments

Jen-tse Huang et al. GAMA-Bench: Evaluating LLMs’ gaming ability in multi-agent environments. InProc. ICLR, 2025

2025

[18] [18]

Xu et al

Frank F. Xu et al. TheAgentCompany: Benchmarking LLM agents on consequential real-world tasks. InAd- vances in NeurIPS, 2025

2025

[19] [19]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao et al. ReAct: Synergizing reasoning and acting in language models. InProc. ICLR, 2023

2023

[20] [20]

V oyager: An open-ended embod- ied agent with large language models

Guanzhi Wang et al. V oyager: An open-ended embod- ied agent with large language models. InAdvances in NeurIPS, 2023

2023

[21] [21]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong et al. MetaGPT: Meta programming for a multi-agent collaborative framework. InProc. ICLR, 2024

2024

[22] [22]

CoLLAB: Coordinating collabo- rative LLM-based agents for mathematical reasoning

Shanshan Gong et al. CoLLAB: Coordinating collabo- rative LLM-based agents for mathematical reasoning. InAdvances in NeurIPS, 2025

2025

[23] [23]

EmbodiedBench: Comprehensive bench- marking multi-modal large language models for embod- ied decision making

Rui Yang et al. EmbodiedBench: Comprehensive bench- marking multi-modal large language models for embod- ied decision making. InProc. ICML, 2025

2025

[24] [24]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn et al. Reflexion: Language agents with verbal reinforcement learning. InAdvances in NeurIPS, 2023

2023

[25] [25]

lost in the middle

Aman Madaan et al. Self-Refine: Iterative refinement with self-feedback. InAdvances in NeurIPS, 2023. Technical Appendix A Interactive Environment Comparison Table 6 provides a structured comparison of representative interactive decision-making environments across ten criteria relevant to complex strategy research. CivRealm is the only environment satisfy...

2023

[26] [26]

Diplomatic Events -- check first: wars declared, treaties, alliances. 0.5. Threat Analysis -- lost city = CRITICAL; distinguish combat vs. non-combat units before raising military alert

[27] [27]

Root Cause -- tax rate, resource allocation, unit utilization

[28] [28]

URGENT RETRY: <Goal>

Production & Garrison Verification -- compare city status to goals. Goal status: Completed / In Progress / Failed / Failed (Timeout). Failed goal MUST trigger “URGENT RETRY: <Goal>” in corrective_actions. Output JSON: {thoughts: {summary, corrective_actions}, goal_evaluations: {...}} J.8 Game Analyst (Post-Game Forensic Analyst) Invoked once at game end. ...

[29] [29]

Garrison & Defense: coverage, Wall-Unit synergy, redundant units

[30] [30]

Production: mix (Settler/Military/Infrastructure), analysis

[31] [31]

Worker Tile Improvement: road network, irrigation timing, idle workers

[32] [32]

Tax Rate: Luxury timing (wasted under Despotism), disorder response

[33] [33]

Technology Path: full chronological sequence, gov-prerequisite trace

[34] [34]

Government Transition: Despotism penalty quantification, transition pace

[35] [35]

Zero-Growth Economy

Over-Militarization vs. Zero-Growth Economy

[36] [36]

Recommendations must form a coherent Rapid Development Pipeline: City founding -> precise defense (Walls + military unit) -> core infrastructure

Expansion Stagnation: pace, city spacing, missed safe-land settlement. Recommendations must form a coherent Rapid Development Pipeline: City founding -> precise defense (Walls + military unit) -> core infrastructure. Output JSON: GameAnalysisResponse {final_state, success,root_causes, recommendations, ...}