SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

Dacheng Tao; Jianbiao Mei; Jiangning Zhang; Jinzhuo Liu; Wencan Jiang; Xiaobin Hu; Yong Liu; Yu Yang; Zhucun Xue

REVIEW 2 major objections 1 minor 1 cited by

SPIKE reuses strategic reasoning across stable game segments and triggers full planning only at event boundaries

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-20 10:28 UTC pith:GCXCCK2U

load-bearing objection SPIKE delivers practical token and latency cuts on StarDojo via event-triggered dual control and split memory, but the trigger's detection accuracy lacks the numbers needed to fully credit the efficiency gains. the 2 major comments →

arxiv 2605.18636 v1 pith:GCXCCK2U submitted 2026-05-18 cs.CV

SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

Wencan Jiang , Jiangning Zhang , Jianbiao Mei , Jinzhuo Liu , Yu Yang , Xiaobin Hu , Zhucun Xue , Yong Liu

show 1 more author

Dacheng Tao

This is my paper

classification cs.CV

keywords dual controllerevent triggerlong-horizon game agentsstrategic planningreactive executionhierarchical memorycost-efficient controlStarDojo

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that long-horizon multimodal game agents can stay goal-directed while respecting tight token and latency limits by splitting control into a low-frequency strategic layer and a high-frequency reactive layer. An event monitor watches for visual shifts, progress stalls, repeated failures, or other signals and decides when to pull in the strategic layer for global replanning or recovery. This reuses one strategic proposal across many local steps instead of recomputing at every interaction. A sympathetic reader would care because constant full reasoning wastes budget on stable stretches while pure reactive control drifts and fails to recover. The design therefore reserves expensive deliberation for the moments it is actually needed.

Core claim

SPIKE is an adaptive dual controller framework for cost-efficient long-horizon game control. Its Strategic Controller performs low-frequency global planning, failure analysis, and recovery, while its Reactive Controller handles fast local execution under a strict token budget. An Event Trigger monitors visual change, task progress, repeated actions, and failure signals to decide when control should stay reactive or escalate to strategic reasoning. Hierarchical Memory separates short-term experience reuse in the State-Action Memory Bank from structured evidence in the State Action Knowledge Graph. This design reuses strategic proposals over multiple reactive steps, supports local override, on

What carries the argument

Event Trigger that monitors visual change, task progress, repeated actions, and failure signals to decide when to escalate from reactive execution to strategic reasoning

Load-bearing premise

The event trigger can reliably detect the right moments for strategic reasoning without excessive false positives or missed escalations that let the reactive controller drift.

What would settle it

Ablating the event trigger on the Lite-100 StarDojo split and measuring whether success-rate and budgeted-success gains over the strongest baselines disappear or reverse.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Strategic reasoning is reused over multiple reactive steps rather than recomputed at every interaction.
Local override remains possible when plans become stale or conditions shift.
Expensive reasoning is reserved for moments where extra deliberation adds value.
Token consumption drops by more than half and latency falls by roughly 40 percent while success rates rise on long-horizon tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same event-triggered split between planning and execution could be tested in other resource-limited sequential decision domains such as robot manipulation under vision noise.
If the trigger thresholds prove stable across games, the architecture suggests a general template for keeping high-level models in reserve rather than in the loop at every timestep.
Separating short-term action memory from structured knowledge graphs may offer a reusable pattern for context management when different control layers need different retrieval styles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

SPIKE delivers practical token and latency cuts on StarDojo via event-triggered dual control and split memory, but the trigger's detection accuracy lacks the numbers needed to fully credit the efficiency gains.

read the letter

SPIKE shows a way to handle long-horizon tasks in games by running expensive strategic reasoning only when an event trigger says it's needed, while a reactive controller takes over the rest. This setup, paired with two kinds of memory, looks like it cuts token use and latency without hurting performance much. What stands out as new is the specific mix: low-frequency strategic controller for planning and recovery, event monitoring for visual changes or failures, and the split between short-term state-action memory and a knowledge graph. Prior work on reasoning versus reactivity is referenced, but this combination for cost control in open-world games seems fresh. The paper does well on the empirical side. It reports clear gains on the Lite-100 split of StarDojo, with success rate up 5 points and budgeted success up 9 points, plus big drops in tokens and latency. The ablations help by showing that the event trigger, override, and memory each add something to the results. One soft spot is the event trigger itself. The abstract says ablations support its role in success and recovery, but there's no numbers on how well it detects the right moments or avoids false alarms. That makes it tough to pin the efficiency gains exactly on selective reasoning rather than the other parts or tuning. If the full paper has trigger stats like precision or per-task breakdowns, that would strengthen it. Another minor point is the lack of detail in the abstract on how baselines were implemented and whether results are averaged over multiple runs. These are standard for robustness claims. This work is for people building agents that need to stay efficient over long sequences in games or similar environments. Someone looking for practical ways to manage LLM costs in embodied tasks would find the framework useful. Overall, the central argument holds up enough on the presented evidence, so it deserves a serious referee. I'd send it to peer review with requests for more on the trigger validation and experimental details.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SPIKE, an adaptive dual-controller framework for cost-efficient long-horizon multimodal game agents. A low-frequency Strategic Controller handles global planning, failure analysis, and recovery; a Reactive Controller manages fast local execution under token budgets; an Event Trigger decides escalation based on visual change, task progress, repeated actions, and failure signals; and Hierarchical Memory (SA-MB for short-term reuse plus SA-KG for structured evidence) supports context retrieval. On the Lite-100 split of StarDojo the method reports +5.0 pp success rate (38.5% relative) and +9.3 pp Budgeted SR (75.6% relative) versus strongest baselines, together with 54.9% token and 40.8% latency reductions. Ablations are cited as evidence that event triggering, reactive override, and heterogeneous memory each contribute to performance.

Significance. If the empirical gains hold under rigorous controls, the selective-reasoning design could be moderately significant for resource-constrained long-horizon agents, showing that strategic deliberation can be reused across stable segments rather than invoked every step. The hierarchical memory separation is a practical contribution. The work does not offer parameter-free derivations, machine-checked proofs, or falsifiable predictions beyond the reported benchmark numbers.

major comments (2)

Abstract and experimental results: the claimed 5.0 pp SR and 54.9% token reductions are presented without any report of the number of runs, statistical significance tests, variance, exact baseline implementations, or data-split procedures. These omissions are load-bearing for assessing whether the gains are robust or could arise from implementation details or split-specific tuning.
Event Trigger (method description): the efficiency claim rests on the trigger correctly escalating only at true boundaries while avoiding drift or wasted calls. Although ablations are said to show its contribution to success/recovery, no precision, recall, false-positive/negative rates, or per-task trigger statistics are supplied. Without these metrics it is impossible to attribute the token savings specifically to selective reasoning rather than to the reactive override or heterogeneous memory alone.

minor comments (1)

Abstract: the Lite-100 split and StarDojo benchmark should be briefly characterized (task count, horizon length, observation modality) so readers can judge applicability without immediately consulting the full experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable comments on our work. We address each of the major comments below and have made revisions to the manuscript to improve clarity and provide additional details where requested.

read point-by-point responses

Referee: Abstract and experimental results: the claimed 5.0 pp SR and 54.9% token reductions are presented without any report of the number of runs, statistical significance tests, variance, exact baseline implementations, or data-split procedures. These omissions are load-bearing for assessing whether the gains are robust or could arise from implementation details or split-specific tuning.

Authors: We agree with the referee that additional experimental details are necessary to substantiate the reported improvements. In the revised manuscript, we have added information specifying that results are averaged over 5 independent runs with different random seeds. We report the standard deviations alongside the means for success rate and token consumption. We have performed and included results from paired t-tests, confirming statistical significance (p < 0.05) for the key gains. Furthermore, we have expanded the experimental setup to describe the exact implementations of the baselines and the procedure for the Lite-100 data split. revision: yes
Referee: Event Trigger (method description): the efficiency claim rests on the trigger correctly escalating only at true boundaries while avoiding drift or wasted calls. Although ablations are said to show its contribution to success/recovery, no precision, recall, false-positive/negative rates, or per-task trigger statistics are supplied. Without these metrics it is impossible to attribute the token savings specifically to selective reasoning rather than to the reactive override or heterogeneous memory alone.

Authors: We appreciate this observation. Our ablations already isolate the effect of the Event Trigger by comparing the full system against a variant without it, showing its contribution to both success and token efficiency. To further address the concern, we have included in the revised version per-task trigger statistics, such as the average number of escalations per episode and examples of trigger activations. However, since the benchmark does not provide ground-truth labels for event boundaries, we are unable to compute precision and recall directly; instead, we rely on the ablation results and qualitative analysis to support the attribution of savings to selective reasoning. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance on external benchmark with no derived predictions or self-referential equations

full rationale

The paper presents an architectural framework (dual controllers, event trigger, hierarchical memory) and validates it via empirical metrics on the Lite-100 split of StarDojo. Success rates, token reductions, and ablation contributions are measured outcomes on an external benchmark rather than quantities obtained by fitting parameters inside the paper's own equations or by renaming inputs as outputs. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claims remain falsifiable against held-out tasks and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the untested premise that an event detector can be built that triggers strategic reasoning at useful moments without introducing new failure modes. No free parameters or invented physical entities are mentioned; the main added structure is the dual-controller split itself.

axioms (1)

domain assumption An Event Trigger based on visual change, task progress, repeated actions, and failure signals can decide when to escalate from reactive to strategic control.
This premise is invoked to justify the adaptive switching mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5867 in / 1426 out tokens · 46430 ms · 2026-05-20T10:28:24.903560+00:00 · methodology

0 comments

read the original abstract

Long-horizon multimodal agents in open-world games must stay goal-directed across many low-level interactions under tight token and latency budgets. Existing approaches often trade off costly per-step reasoning against reactive execution that can drift, repeat failures, and recover poorly. Our key idea is to reuse strategic reasoning across locally stable segments and reinvoke it at event boundaries. We present SPIKE, an adaptive dual controller framework for cost-efficient long-horizon game control. Its Strategic Controller performs low-frequency global planning, failure analysis, and recovery, while its Reactive Controller handles fast local execution under a strict token budget. An Event Trigger monitors visual change, task progress, repeated actions, and failure signals to decide when control should stay reactive or escalate to strategic reasoning. Hierarchical Memory separates short-term experience reuse in the State-Action Memory Bank (SA-MB) from structured evidence in the State Action Knowledge Graph (SA-KG), allowing each controller to retrieve the context it needs. This design reuses strategic proposals over multiple reactive steps, supports local override when plans become stale, and reserves expensive reasoning for moments where extra deliberation is useful. On the Lite-100 split of StarDojo, SPIKE improves Lite-100 success rate (SR) by 5.0 percentage points (38.5% relative) over the strongest Lite-100 baseline and Budgeted SR by 9.3 points (75.6% relative) over the strongest budgeted baseline. It also reduces token consumption by 54.9% and latency by 40.8%. Ablations show that event triggering, reactive override, and heterogeneous memory each contribute to success and recovery, supporting selective reasoning rather than reasoning at every step.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ChatImage: Navigating Long-Form LLM Answers through Interactive Images
cs.CV 2026-07 conditional novelty 5.0

ChatImage renders LLM answers as images, then uses visual grounding to place clickable hotspots on rendered regions for interactive follow-up.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Scaling instructable agents across many simulated worlds

MariaAbiRaad, ArunAhuja, CatarinaBarros, FredericBesse, AndrewBolt, AdrianBolton, BethanieBrownfield, Gavin Buttimore, Max Cant, Sarah Chakera, et al. Scaling instructable agents across many simulated worlds. arXiv preprint arXiv:2404.10179, 2024. 3

work page arXiv 2024
[2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, et al. Do as I can, not as I say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Mind2Web: Towards a Generalist Agent for the Web

Xiang Deng, Yu Gu, Boyuan Zheng, et al. Mind2Web: Towards a generalist agent for the web.arXiv preprint arXiv:2306.06070, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A Graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

MineDojo: Building open-ended embodied agents with internet- scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. MineDojo: Building open-ended embodied agents with internet- scale knowledge. InAdvances in Neural Information Processing Systems, 2022. 3

work page 2022
[7]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. 3

work page 2023
[8]

ISBN 979-8-89176-251-0

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. HiAgent: Hierarchical working memory management for solving long-horizon agent tasks with large language model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 32779–32798, 2025. doi: 10.18653/v1/2025.acl-long.1575. 3 11

work page doi:10.18653/v1/2025.acl-long.1575 2025
[9]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

HippoRAG: Neurobiologically inspired long-term memory for large language models

Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InAdvances in Neural Information Processing Systems,

work page
[11]

Farrar, Straus and Giroux, 2011

Daniel Kahneman.Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011. 5

work page 2011
[12]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020. 4

work page 2020
[13]

Textatari: 100k frames game playing with language agents, 2025

Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, and Bo Jin. TextAtari: 100K frames game playing with language agents.arXiv preprint arXiv:2506.04098, 2025. 3

work page arXiv 2025
[14]

Optimus-3: Towards generalist multi- modal minecraft agents with scalable task experts.arXiv preprint arXiv:2506.10357, 2025f

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Yaowei Wang, and Liqiang Nie. Optimus-3: Dual-router aligned mixture-of-experts agent with dual-granularity reasoning-aware policy optimization.arXiv preprint arXiv:2506.10357, 2025. 2

work page arXiv 2025
[15]

SwiftSage: A generative agent with fast and slow thinking for complex interactive tasks

Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. SwiftSage: A generative agent with fast and slow thinking for complex interactive tasks. InAdvances in Neural Information Processing Systems, 2023. 2, 3

work page 2023
[16]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representatio...

work page 2024
[17]

Magne, A

Loic Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, Yisong Yue, Yejin Choi, Yuke Zhu, and Linxi Fan. NitroGen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026. 1, 3

work page arXiv 2026
[18]

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, and Mike Zheng Shou. GameWorld: Towards standardized and verifiable evaluation of multimodal game agents.arXiv preprint arXiv:2604.07429, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InACM Symposium on User Interface Software and Technology, 2023. 2, 3

work page 2023
[21]

ADa PT : As-Needed Decomposition and Planning with Language Models

Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. ADaPT: As-needed decomposition and planning with language models. InFindings of ACL: NAACL 2024, pages 4226–4252. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024.findings-naacl.264. URLhttps://aclanthology.org/2024.findi...

work page doi:10.18653/v1/2024.findings-naacl.264 2024
[22]

Narasimhan, and Shunyu Yao

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023. 3, 7, 8, 10

work page 2023
[23]

ALFWorld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021. 3

work page 2021
[24]

AdaPlanner: Adaptive planning from feedback with language models.arXiv preprint, 2023

Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. AdaPlanner: Adaptive planning from feedback with language models.arXiv preprint arXiv:2305.16653, 2023. 3

work page arXiv 2023
[25]

Cradle: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186,

Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, et al. CRADLE: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186, 2024. 1, 2, 3, 7, 8, 10 12

work page arXiv 2024
[26]

Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025

Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, and Bo An. StarDojo: Benchmarking open-ended behaviors of agentic multimodal LLMs in production-living simulations with Stardew Valley.arXiv preprint arXiv:2507.07445, 2025. 1, 3, 7, 8, 10

work page internal anchor Pith review arXiv 2025
[27]

Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, and Guang Shi. Lumine: An open recipe for building generalist agents in 3D open worlds.arXiv preprint arXiv:2511.08892, 2025. 3

work page arXiv 2025
[28]

Experience-Driven Exploration for Efficient API-Free AI Agents

Chenwei Tang, Lin Long, Xinyu Liu, Jingyu Xing, Zizhou Wang, Joey Tianyi Zhou, Jiawei Du, Liangli Zhen, and Jiancheng Lv. SAG-Agent: Enabling long-horizon reasoning in strategy games via dynamic knowledge graphs. arXiv preprint arXiv:2510.15259, 2025. Earlier versions circulated as “Experience-Driven Exploration for Efficient API-Free AI Agents”. 4, 7

work page arXiv 2025
[29]

Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024. 1, 3, 7, 8, 10

work page 2024
[30]

Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. InAdvances in Neural Information Processing Systems, 2023. 3

work page 2023
[31]

Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models,

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. JARVIS-1: Open-world multi-task agents with memory- augmented multimodal language models.arXiv preprint arXiv:2311.05997, 2023. 3

work page arXiv 2023
[32]

ScenDroid: A scenario-level benchmark for long-horizon, time-evolving GUI agents, 2026

Zhe Wu, Yongxin Kang, Dabin Sheng, Junliang Xing, Guokun Wu, Derek Yuen, Donglin Mo, Yuheng Jing, Kai Li, Weilin Luo, Kun Shao, and Yuanchun Shi. ScenDroid: A scenario-level benchmark for long-horizon, time-evolving GUI agents, 2026. URLhttps://openreview.net/forum?id=hBTsLjjw48. OpenReview LLA 2026 poster. 3

work page 2026
[33]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, 2023. 3

work page 2023
[34]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations,

work page
[35]

AgentOdyssey: Open-ended long-horizon text game generation for test-time continual learning agents, 2026

Zheyuan Zhang, Zehao Wen, Alvin Zhang, Andrew Wang, Jianwen Xie, Daniel Khashabi, and Tianmin Shu. AgentOdyssey: Open-ended long-horizon text game generation for test-time continual learning agents, 2026. URLhttps://agentodyssey.github.io/. Project page. 4

work page 2026
[36]

V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, and Lijuan Wang. V-MAGE: A game evaluation framework for assessing vision-centric capabilities in multimodal large language models.arXiv preprint arXiv:2504.06148, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 62138–62160,

work page
[38]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, et al. WebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 1, 3 13 Appendix Contents A Metric and Reproducibility Details in STARDOJO15 A.1 Step-budget SR versus Budgeted SR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Scaling instructable agents across many simulated worlds

MariaAbiRaad, ArunAhuja, CatarinaBarros, FredericBesse, AndrewBolt, AdrianBolton, BethanieBrownfield, Gavin Buttimore, Max Cant, Sarah Chakera, et al. Scaling instructable agents across many simulated worlds. arXiv preprint arXiv:2404.10179, 2024. 3

work page arXiv 2024

[2] [2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, et al. Do as I can, not as I say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Mind2Web: Towards a Generalist Agent for the Web

Xiang Deng, Yu Gu, Boyuan Zheng, et al. Mind2Web: Towards a generalist agent for the web.arXiv preprint arXiv:2306.06070, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A Graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

MineDojo: Building open-ended embodied agents with internet- scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. MineDojo: Building open-ended embodied agents with internet- scale knowledge. InAdvances in Neural Information Processing Systems, 2022. 3

work page 2022

[7] [7]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. 3

work page 2023

[8] [8]

ISBN 979-8-89176-251-0

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. HiAgent: Hierarchical working memory management for solving long-horizon agent tasks with large language model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 32779–32798, 2025. doi: 10.18653/v1/2025.acl-long.1575. 3 11

work page doi:10.18653/v1/2025.acl-long.1575 2025

[9] [9]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

HippoRAG: Neurobiologically inspired long-term memory for large language models

Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InAdvances in Neural Information Processing Systems,

work page

[11] [11]

Farrar, Straus and Giroux, 2011

Daniel Kahneman.Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011. 5

work page 2011

[12] [12]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020. 4

work page 2020

[13] [13]

Textatari: 100k frames game playing with language agents, 2025

Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, and Bo Jin. TextAtari: 100K frames game playing with language agents.arXiv preprint arXiv:2506.04098, 2025. 3

work page arXiv 2025

[14] [14]

Optimus-3: Towards generalist multi- modal minecraft agents with scalable task experts.arXiv preprint arXiv:2506.10357, 2025f

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Yaowei Wang, and Liqiang Nie. Optimus-3: Dual-router aligned mixture-of-experts agent with dual-granularity reasoning-aware policy optimization.arXiv preprint arXiv:2506.10357, 2025. 2

work page arXiv 2025

[15] [15]

SwiftSage: A generative agent with fast and slow thinking for complex interactive tasks

Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. SwiftSage: A generative agent with fast and slow thinking for complex interactive tasks. InAdvances in Neural Information Processing Systems, 2023. 2, 3

work page 2023

[16] [16]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representatio...

work page 2024

[17] [17]

Magne, A

Loic Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, Yisong Yue, Yejin Choi, Yuke Zhu, and Linxi Fan. NitroGen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026. 1, 3

work page arXiv 2026

[18] [18]

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, and Mike Zheng Shou. GameWorld: Towards standardized and verifiable evaluation of multimodal game agents.arXiv preprint arXiv:2604.07429, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InACM Symposium on User Interface Software and Technology, 2023. 2, 3

work page 2023

[21] [21]

ADa PT : As-Needed Decomposition and Planning with Language Models

Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. ADaPT: As-needed decomposition and planning with language models. InFindings of ACL: NAACL 2024, pages 4226–4252. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024.findings-naacl.264. URLhttps://aclanthology.org/2024.findi...

work page doi:10.18653/v1/2024.findings-naacl.264 2024

[22] [22]

Narasimhan, and Shunyu Yao

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023. 3, 7, 8, 10

work page 2023

[23] [23]

ALFWorld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021. 3

work page 2021

[24] [24]

AdaPlanner: Adaptive planning from feedback with language models.arXiv preprint, 2023

Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. AdaPlanner: Adaptive planning from feedback with language models.arXiv preprint arXiv:2305.16653, 2023. 3

work page arXiv 2023

[25] [25]

Cradle: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186,

Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, et al. CRADLE: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186, 2024. 1, 2, 3, 7, 8, 10 12

work page arXiv 2024

[26] [26]

Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025

Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, and Bo An. StarDojo: Benchmarking open-ended behaviors of agentic multimodal LLMs in production-living simulations with Stardew Valley.arXiv preprint arXiv:2507.07445, 2025. 1, 3, 7, 8, 10

work page internal anchor Pith review arXiv 2025

[27] [27]

Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, and Guang Shi. Lumine: An open recipe for building generalist agents in 3D open worlds.arXiv preprint arXiv:2511.08892, 2025. 3

work page arXiv 2025

[28] [28]

Experience-Driven Exploration for Efficient API-Free AI Agents

Chenwei Tang, Lin Long, Xinyu Liu, Jingyu Xing, Zizhou Wang, Joey Tianyi Zhou, Jiawei Du, Liangli Zhen, and Jiancheng Lv. SAG-Agent: Enabling long-horizon reasoning in strategy games via dynamic knowledge graphs. arXiv preprint arXiv:2510.15259, 2025. Earlier versions circulated as “Experience-Driven Exploration for Efficient API-Free AI Agents”. 4, 7

work page arXiv 2025

[29] [29]

Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024. 1, 3, 7, 8, 10

work page 2024

[30] [30]

Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. InAdvances in Neural Information Processing Systems, 2023. 3

work page 2023

[31] [31]

Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models,

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. JARVIS-1: Open-world multi-task agents with memory- augmented multimodal language models.arXiv preprint arXiv:2311.05997, 2023. 3

work page arXiv 2023

[32] [32]

ScenDroid: A scenario-level benchmark for long-horizon, time-evolving GUI agents, 2026

Zhe Wu, Yongxin Kang, Dabin Sheng, Junliang Xing, Guokun Wu, Derek Yuen, Donglin Mo, Yuheng Jing, Kai Li, Weilin Luo, Kun Shao, and Yuanchun Shi. ScenDroid: A scenario-level benchmark for long-horizon, time-evolving GUI agents, 2026. URLhttps://openreview.net/forum?id=hBTsLjjw48. OpenReview LLA 2026 poster. 3

work page 2026

[33] [33]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, 2023. 3

work page 2023

[34] [34]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations,

work page

[35] [35]

AgentOdyssey: Open-ended long-horizon text game generation for test-time continual learning agents, 2026

Zheyuan Zhang, Zehao Wen, Alvin Zhang, Andrew Wang, Jianwen Xie, Daniel Khashabi, and Tianmin Shu. AgentOdyssey: Open-ended long-horizon text game generation for test-time continual learning agents, 2026. URLhttps://agentodyssey.github.io/. Project page. 4

work page 2026

[36] [36]

V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, and Lijuan Wang. V-MAGE: A game evaluation framework for assessing vision-centric capabilities in multimodal large language models.arXiv preprint arXiv:2504.06148, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 62138–62160,

work page

[38] [38]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, et al. WebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 1, 3 13 Appendix Contents A Metric and Reproducibility Details in STARDOJO15 A.1 Step-budget SR versus Budgeted SR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2023