pith. sign in

arxiv: 2606.29932 · v1 · pith:G2DSQWECnew · submitted 2026-06-29 · 💻 cs.AI

SAGA: Scene-Aware, Goal-Evolving Agents for Long-Horizon CivRealm Strategy Planning

Pith reviewed 2026-06-30 06:19 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsmulti-agent planningstrategy gamesscene graphslong-horizon reasoningFreeCivgoal evolutioninfrastructure construction
0
0 comments X

The pith

SAGA's three mechanisms let LLM agents reach higher civilization scores in FreeCiv by fixing scene blindness, context overload, and isolated learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SAGA as an LLM multi-agent system for long-horizon strategy planning in games like FreeCiv, where agents must handle multiple domains under sparse rewards and imperfect information. Existing LLM agents fail at spatial understanding from raw coordinates, suffer context overflow from full state dumps that couple domains, and learn nothing across separate episodes. SAGA counters these with a Map-Semantic Scene Graph that turns spatial relations into per-unit language, a Tool-Augmented Planner that requests domain state on demand and routes directives to specialists, and a Dual-Horizon Feedback Loop that generates goals inside games while running causal post-mortems across games. In FreeCiv evaluations SAGA records the highest mean score with lower variance, leads uniquely on infrastructure, beats the top baselines in most direct matches, and cuts output tokens by 27 percent; adding the cross-game module yields the best chained performance over five episodes.

Core claim

SAGA attains the highest mean civilization score with lower variance than the two strongest baselines, is the only method that significantly surpasses every baseline on infrastructure construction, outscores the two strongest baselines in most head-to-head games while cutting output tokens by 27 percent, and equipped with the cross-game evolution module reaches the highest end-of-chain score across five successive episodes.

What carries the argument

Map-Semantic Scene Graph, Tool-Augmented Planner, and Dual-Horizon Feedback Loop, each addressing one stated failure mode in LLM strategy agents.

If this is right

  • Infrastructure construction improves without trade-offs against other objectives.
  • Output token count drops by 27 percent while head-to-head wins increase.
  • Cross-game causal post-mortems enable rising scores over successive episodes without manual reward design.
  • Each of the three components contributes measurably on its own, per the reported ablations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The scene-graph approach might transfer to other spatial or imperfect-information settings where raw coordinates cause planning errors.
  • On-demand domain tools could lower the risk of constraint violations in any multi-domain LLM controller.
  • Structured cross-episode post-mortems offer a route to continual adaptation that does not rely on environment-specific rewards.

Load-bearing premise

The three mechanisms each directly and independently fix one failure mode and that FreeCiv metrics plus baselines isolate those fixes without prompt or environment confounds.

What would settle it

An ablation that removes one mechanism yet shows no matching drop in the corresponding metric, or a new baseline that matches SAGA scores without any of the three components.

Figures

Figures reproduced from arXiv: 2606.29932 by Liuyu Xiang, Shuo Chen, Tianyu Jin, Yexin Li, Yida Wang, Yingzhuo Liu, Zhaofeng He, Zhiyao Jiang.

Figure 1
Figure 1. Figure 1: FreeCiv as a complex multi-domain strategy benchmark. (a) CivRealm uniquely combines long credit-assignment horizons with maximal domain coupling. (b) Domain-level breakdown for three representative environments. Dual-Horizon Feedback Loop—that addresses five funda￾mental failure modes in long-horizon multi-domain plan￾ning: scene grounding, context scalability, domain decou￾pling, constraint awareness, an… view at source ↗
Figure 2
Figure 2. Figure 2: SAGA System Architecture. Left: FreeCiv game state at turn t with three decision points: ⃝A FullLLMObsWrapper converts raw game state into structured natural language; ⃝B Map-Semantic SceneGraph G=(V, E) encodes friend￾ly/enemy spatial relations (distance, direction, threat edges) as concise relational context for the planner (§4.2); ⃝C the ReActAgent Planner calls on-demand tools and dispatches a structur… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-game Evolution & Multi-Seed Holistic Validation. (a) Left: All methods are equipped with the same cross-game evolution module. While all methods improve from Game 1 to Game 5, baseline trajectories exhibit pronounced inter-episode oscillation. SAGA (solid red) reaches the highest score at Game 5 (90)—the highest across all method-stage combinations—and leads the field throughout, indicating its archi… view at source ↗
Figure 4
Figure 4. Figure 4: GPT-4o-mini evolution trajectory on Map Seed 2029. Final Score across the five-game evolution chain (G1→G5) on the resource-poor map. SAGA (solid red) dom￾inates both baselines at every step and peaks at G3 (Score 61) before regressing as it lapses back into Despotism. The weak backbone makes gains non-monotonic, yet SAGA never drops into the oscillating low band occupied by HIMA and CoS—structured executi… view at source ↗
read the original abstract

Long-horizon strategic planning in complex strategy games demands concurrent reasoning across multiple decision domains under imperfect information and sparse reward. Existing LLM-based agents suffer from three systematic failures: scene blindness from raw tile coordinates, context overflow and domain coupling from monolithic state dumps, and shallow cross-game learning that treats each episode in isolation. We present SAGA, an LLM multi-agent framework with three mechanisms each directly targeting one class of failure: (i) a Map-Semantic Scene Graph that encodes typed spatial relations among game entities into per-unit natural-language context, resolving spatial blindness without global token inflation; (ii) a Tool-Augmented Planner that pulls fine-grained domain state on demand and dispatches per-domain directives to dedicated specialist controllers, eliminating context overflow, domain coupling, and mechanical constraint violations; and (iii) a Dual-Horizon Feedback Loop that combines periodic within-game goal generation with structured cross-game causal post-mortem, enabling principled strategic evolution without manual reward engineering. Evaluated on FreeCiv, SAGA attains the highest mean civilization score -- the environment's sole sparse objective reward -- with lower variance than the two strongest baselines, and is the only method that significantly surpasses every baseline on infrastructure construction, the resource axis most readily sacrificed under multi-objective conflict. It outscores the two strongest baselines in most head-to-head games while cutting output tokens (the dominant decoding cost) by 27%. Equipped with the cross-game evolution module, SAGA reaches the highest end-of-chain score across five successive episodes. Ablation studies confirm that each architectural component contributes independently to this advantage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SAGA, an LLM-based multi-agent framework for long-horizon strategy planning in CivRealm (FreeCiv). It identifies three failure modes in existing agents (scene blindness from raw coordinates, context overflow/domain coupling from monolithic states, and shallow cross-game learning) and proposes three targeted mechanisms: (i) a Map-Semantic Scene Graph encoding typed spatial relations into per-unit natural-language context, (ii) a Tool-Augmented Planner that fetches domain-specific state on demand and dispatches to specialist controllers, and (iii) a Dual-Horizon Feedback Loop combining within-game goal generation with cross-game causal post-mortems. On FreeCiv, SAGA reports the highest mean civilization score with lower variance than strong baselines, is the only method to significantly outperform all baselines on infrastructure construction, reduces output tokens by 27%, wins most head-to-head games against top baselines, and achieves the highest end-of-chain score over five successive episodes when using the evolution module. Ablations are stated to confirm independent contributions from each component.

Significance. If the empirical claims hold under rigorous controls, SAGA would provide a concrete, modular architecture for improving LLM agents on sparse-reward, multi-domain, long-horizon tasks. The explicit mapping of mechanisms to failure modes, the use of a standard strategy-game benchmark, and the reported gains in both performance and token efficiency would be useful reference points for the community. The cross-game evolution component is particularly noteworthy as an attempt at principled meta-learning without manual reward design.

major comments (2)
  1. [Experiments / Ablation studies] Experiments / Ablation studies: The central attribution—that each of the three mechanisms independently resolves one failure mode and that the observed gains are not due to overall prompt structure or baseline prompt quality—requires a factorial ablation design or at least matched-prompt controls that vary only the claimed component. The manuscript does not appear to report such controls, leaving open the possibility that the reported superiority on civilization score, infrastructure, variance, and token count arises from differences in prompt engineering rather than the scene-graph or post-mortem modules specifically.
  2. [Evaluation protocol] Evaluation protocol: No details are supplied on the number of independent runs, statistical tests (e.g., significance thresholds for “significantly surpasses every baseline”), random seeds, or exact baseline implementations and prompt templates. Without these, it is impossible to verify that the head-to-head wins, infrastructure gains, and 27% token reduction are reproducible and attributable to the claimed mechanisms rather than environment-specific tuning.
minor comments (2)
  1. [Abstract / Introduction] The abstract and introduction should explicitly state the total number of episodes, the precise definition of “civilization score,” and the token-budget matching procedure used for baselines.
  2. [Methods] Notation for the scene graph (typed relations, per-unit context construction) and the exact interface of the Tool-Augmented Planner would benefit from a small diagram or pseudocode in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for stronger experimental controls and reproducibility details. We address each major comment below and will revise the manuscript to incorporate additional ablations and protocol information.

read point-by-point responses
  1. Referee: [Experiments / Ablation studies] The central attribution—that each of the three mechanisms independently resolves one failure mode and that the observed gains are not due to overall prompt structure or baseline prompt quality—requires a factorial ablation design or at least matched-prompt controls that vary only the claimed component. The manuscript does not appear to report such controls, leaving open the possibility that the reported superiority on civilization score, infrastructure, variance, and token count arises from differences in prompt engineering rather than the scene-graph or post-mortem modules specifically.

    Authors: We agree that a factorial design or matched-prompt controls would provide stronger isolation of each mechanism's contribution. Our existing ablations remove one component at a time from the full SAGA framework while preserving overall structure, and these show independent gains; however, they do not fully rule out prompt-engineering confounds. In the revision we will add a new set of matched-prompt controls that vary only the targeted component (e.g., scene-graph encoding versus equivalent-length raw-coordinate prompts) and report the resulting performance deltas. revision: yes

  2. Referee: [Evaluation protocol] No details are supplied on the number of independent runs, statistical tests (e.g., significance thresholds for “significantly surpasses every baseline”), random seeds, or exact baseline implementations and prompt templates. Without these, it is impossible to verify that the head-to-head wins, infrastructure gains, and 27% token reduction are reproducible and attributable to the claimed mechanisms rather than environment-specific tuning.

    Authors: We acknowledge the omission of these reproducibility details. The revised manuscript will include: the exact number of independent runs and random seeds used, the statistical tests and significance thresholds applied to claims of outperformance, and the full prompt templates plus baseline implementation details placed in an appendix. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes an empirical LLM multi-agent framework evaluated via head-to-head experiments and ablations on FreeCiv. Claims rest on observed scores, variance reductions, token savings, and component contributions rather than any mathematical derivation, equations, fitted parameters renamed as predictions, or self-citation chains. No load-bearing steps reduce to inputs by construction; the work is self-contained against external environment metrics and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The paper's central claims rest on three newly introduced mechanisms whose correctness is not independently verified outside the reported experiments; no free parameters or invented physical entities are described.

axioms (2)
  • domain assumption Natural-language descriptions of typed spatial relations suffice for LLM spatial reasoning in tile-based games
    Invoked by the Map-Semantic Scene Graph component
  • domain assumption Per-domain specialist controllers can be dispatched without introducing new coupling or constraint violations
    Invoked by the Tool-Augmented Planner
invented entities (3)
  • Map-Semantic Scene Graph no independent evidence
    purpose: Encodes typed spatial relations among game entities into per-unit natural-language context
    New component introduced to resolve scene blindness
  • Tool-Augmented Planner no independent evidence
    purpose: Pulls fine-grained domain state on demand and dispatches per-domain directives
    New component introduced to resolve context overflow and domain coupling
  • Dual-Horizon Feedback Loop no independent evidence
    purpose: Combines periodic within-game goal generation with structured cross-game causal post-mortem
    New component introduced to enable strategic evolution

pith-pipeline@v0.9.1-grok · 5843 in / 1664 out tokens · 40488 ms · 2026-06-30T06:19:56.583002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 2 canonical work pages

  1. [1]

    CivRealm: A learning and reasoning odyssey in Civilization for decision-making agents

    Siyuan Qi et al. CivRealm: A learning and reasoning odyssey in Civilization for decision-making agents. In Proc. ICLR, 2024

  2. [2]

    Large language models play StarCraft II: Benchmarks and a chain of summarization approach

    Weiyu Ma et al. Large language models play StarCraft II: Benchmarks and a chain of summarization approach. InAdvances in NeurIPS, 2024

  3. [3]

    LLMs are not good strate- gists, yet memory-enhanced agency boosts reasoning

    Yi Wu and Zhimin Hu. LLMs are not good strate- gists, yet memory-enhanced agency boosts reasoning. InICLR Workshop on Reasoning and Planning for LLMs, 2025

  4. [4]

    Society of mind meets real-time strategy: A hierarchical multi- agent framework for strategic reasoning.arXiv preprint arXiv:2408.15567, 2025

    Daechul Ahn, San Kim, and Jonghyun Choi. Society of mind meets real-time strategy: A hierarchical multi- agent framework for strategic reasoning.arXiv preprint arXiv:2408.15567, 2025

  5. [5]

    Optimus-2: Multimodal Minecraft agent with goal-observation-action conditioned policy

    Zaijing Li et al. Optimus-2: Multimodal Minecraft agent with goal-observation-action conditioned policy. InProc. CVPR, 2025

  6. [6]

    EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2412.04843, 2025

    Rong Wu et al. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2412.04843, 2025

  7. [7]

    SE-Agent: Self-evolution trajectory optimization in multi-step reasoning with LLM-based agents

    Yifu Guo et al. SE-Agent: Self-evolution trajectory optimization in multi-step reasoning with LLM-based agents. InAdvances in NeurIPS, 2025

  8. [8]

    SiriuS: Self-improving multi-agent systems via bootstrapped reasoning

    Wanjia Zhao, Mert Yuksekgonul, Shirley Wu, and James Zou. SiriuS: Self-improving multi-agent systems via bootstrapped reasoning. InAdvances in NeurIPS, 2025

  9. [9]

    The Hanabi challenge: A new fron- tier for AI research.Artificial Intelligence, 280:103216, 2020

    Nolan Bard et al. The Hanabi challenge: A new fron- tier for AI research.Artificial Intelligence, 280:103216, 2020

  10. [10]

    Leibo et al

    Joel Z. Leibo et al. Scalable evaluation of multi-agent re- inforcement learning with Melting Pot. InProc. ICML, pp. 6187–6199, 2021

  11. [11]

    Dota 2 with large scale deep re- inforcement learning

    Christopher Berner et al. Dota 2 with large scale deep re- inforcement learning. Technical report, OpenAI, 2019

  12. [12]

    Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575(7782):350–354, 2019

    Oriol Vinyals et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575(7782):350–354, 2019

  13. [13]

    MineDojo: Building open-ended embod- ied agents with internet-scale knowledge

    Linxi Fan et al. MineDojo: Building open-ended embod- ied agents with internet-scale knowledge. InAdvances in NeurIPS, 2022

  14. [14]

    No-press Diplomacy: Modeling multi-agent gameplay

    Philip Paquette et al. No-press Diplomacy: Modeling multi-agent gameplay. InAdvances in NeurIPS, 32, 2019

  15. [15]

    WebArena: A realistic web environ- ment for building autonomous agents

    Shuyan Zhou et al. WebArena: A realistic web environ- ment for building autonomous agents. InProc. ICLR, 2024

  16. [16]

    OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environ- ments

    Tianbao Xie et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environ- ments. InAdvances in NeurIPS, 2024

  17. [17]

    GAMA-Bench: Evaluating LLMs’ gaming ability in multi-agent environments

    Jen-tse Huang et al. GAMA-Bench: Evaluating LLMs’ gaming ability in multi-agent environments. InProc. ICLR, 2025

  18. [18]

    Xu et al

    Frank F. Xu et al. TheAgentCompany: Benchmarking LLM agents on consequential real-world tasks. InAd- vances in NeurIPS, 2025

  19. [19]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao et al. ReAct: Synergizing reasoning and acting in language models. InProc. ICLR, 2023

  20. [20]

    V oyager: An open-ended embod- ied agent with large language models

    Guanzhi Wang et al. V oyager: An open-ended embod- ied agent with large language models. InAdvances in NeurIPS, 2023

  21. [21]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong et al. MetaGPT: Meta programming for a multi-agent collaborative framework. InProc. ICLR, 2024

  22. [22]

    CoLLAB: Coordinating collabo- rative LLM-based agents for mathematical reasoning

    Shanshan Gong et al. CoLLAB: Coordinating collabo- rative LLM-based agents for mathematical reasoning. InAdvances in NeurIPS, 2025

  23. [23]

    EmbodiedBench: Comprehensive bench- marking multi-modal large language models for embod- ied decision making

    Rui Yang et al. EmbodiedBench: Comprehensive bench- marking multi-modal large language models for embod- ied decision making. InProc. ICML, 2025

  24. [24]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn et al. Reflexion: Language agents with verbal reinforcement learning. InAdvances in NeurIPS, 2023

  25. [25]

    lost in the middle

    Aman Madaan et al. Self-Refine: Iterative refinement with self-feedback. InAdvances in NeurIPS, 2023. Technical Appendix A Interactive Environment Comparison Table 6 provides a structured comparison of representative interactive decision-making environments across ten criteria relevant to complex strategy research. CivRealm is the only environment satisfy...

  26. [26]

    Diplomatic Events -- check first: wars declared, treaties, alliances. 0.5. Threat Analysis -- lost city = CRITICAL; distinguish combat vs. non-combat units before raising military alert

  27. [27]

    Root Cause -- tax rate, resource allocation, unit utilization

  28. [28]

    URGENT RETRY: <Goal>

    Production & Garrison Verification -- compare city status to goals. Goal status: Completed / In Progress / Failed / Failed (Timeout). Failed goal MUST trigger “URGENT RETRY: <Goal>” in corrective_actions. Output JSON: {thoughts: {summary, corrective_actions}, goal_evaluations: {...}} J.8 Game Analyst (Post-Game Forensic Analyst) Invoked once at game end. ...

  29. [29]

    Garrison & Defense: coverage, Wall-Unit synergy, redundant units

  30. [30]

    Production: mix (Settler/Military/Infrastructure), analysis

  31. [31]

    Worker Tile Improvement: road network, irrigation timing, idle workers

  32. [32]

    Tax Rate: Luxury timing (wasted under Despotism), disorder response

  33. [33]

    Technology Path: full chronological sequence, gov-prerequisite trace

  34. [34]

    Government Transition: Despotism penalty quantification, transition pace

  35. [35]

    Zero-Growth Economy

    Over-Militarization vs. Zero-Growth Economy

  36. [36]

    Recommendations must form a coherent Rapid Development Pipeline: City founding -> precise defense (Walls + military unit) -> core infrastructure

    Expansion Stagnation: pace, city spacing, missed safe-land settlement. Recommendations must form a coherent Rapid Development Pipeline: City founding -> precise defense (Walls + military unit) -> core infrastructure. Output JSON: GameAnalysisResponse {final_state, success,root_causes, recommendations, ...}