pith. sign in

arxiv: 2605.18636 · v1 · pith:GCXCCK2Unew · submitted 2026-05-18 · 💻 cs.CV

SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

Pith reviewed 2026-05-20 10:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords dual controllerevent triggerlong-horizon game agentsstrategic planningreactive executionhierarchical memorycost-efficient controlStarDojo
0
0 comments X

The pith

SPIKE reuses strategic reasoning across stable game segments and triggers full planning only at event boundaries

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that long-horizon multimodal game agents can stay goal-directed while respecting tight token and latency limits by splitting control into a low-frequency strategic layer and a high-frequency reactive layer. An event monitor watches for visual shifts, progress stalls, repeated failures, or other signals and decides when to pull in the strategic layer for global replanning or recovery. This reuses one strategic proposal across many local steps instead of recomputing at every interaction. A sympathetic reader would care because constant full reasoning wastes budget on stable stretches while pure reactive control drifts and fails to recover. The design therefore reserves expensive deliberation for the moments it is actually needed.

Core claim

SPIKE is an adaptive dual controller framework for cost-efficient long-horizon game control. Its Strategic Controller performs low-frequency global planning, failure analysis, and recovery, while its Reactive Controller handles fast local execution under a strict token budget. An Event Trigger monitors visual change, task progress, repeated actions, and failure signals to decide when control should stay reactive or escalate to strategic reasoning. Hierarchical Memory separates short-term experience reuse in the State-Action Memory Bank from structured evidence in the State Action Knowledge Graph. This design reuses strategic proposals over multiple reactive steps, supports local override, on

What carries the argument

Event Trigger that monitors visual change, task progress, repeated actions, and failure signals to decide when to escalate from reactive execution to strategic reasoning

If this is right

  • Strategic reasoning is reused over multiple reactive steps rather than recomputed at every interaction.
  • Local override remains possible when plans become stale or conditions shift.
  • Expensive reasoning is reserved for moments where extra deliberation adds value.
  • Token consumption drops by more than half and latency falls by roughly 40 percent while success rates rise on long-horizon tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same event-triggered split between planning and execution could be tested in other resource-limited sequential decision domains such as robot manipulation under vision noise.
  • If the trigger thresholds prove stable across games, the architecture suggests a general template for keeping high-level models in reserve rather than in the loop at every timestep.
  • Separating short-term action memory from structured knowledge graphs may offer a reusable pattern for context management when different control layers need different retrieval styles.

Load-bearing premise

The event trigger can reliably detect the right moments for strategic reasoning without excessive false positives or missed escalations that let the reactive controller drift.

What would settle it

Ablating the event trigger on the Lite-100 StarDojo split and measuring whether success-rate and budgeted-success gains over the strongest baselines disappear or reverse.

read the original abstract

Long-horizon multimodal agents in open-world games must stay goal-directed across many low-level interactions under tight token and latency budgets. Existing approaches often trade off costly per-step reasoning against reactive execution that can drift, repeat failures, and recover poorly. Our key idea is to reuse strategic reasoning across locally stable segments and reinvoke it at event boundaries. We present SPIKE, an adaptive dual controller framework for cost-efficient long-horizon game control. Its Strategic Controller performs low-frequency global planning, failure analysis, and recovery, while its Reactive Controller handles fast local execution under a strict token budget. An Event Trigger monitors visual change, task progress, repeated actions, and failure signals to decide when control should stay reactive or escalate to strategic reasoning. Hierarchical Memory separates short-term experience reuse in the State-Action Memory Bank (SA-MB) from structured evidence in the State Action Knowledge Graph (SA-KG), allowing each controller to retrieve the context it needs. This design reuses strategic proposals over multiple reactive steps, supports local override when plans become stale, and reserves expensive reasoning for moments where extra deliberation is useful. On the Lite-100 split of StarDojo, SPIKE improves Lite-100 success rate (SR) by 5.0 percentage points (38.5% relative) over the strongest Lite-100 baseline and Budgeted SR by 9.3 points (75.6% relative) over the strongest budgeted baseline. It also reduces token consumption by 54.9% and latency by 40.8%. Ablations show that event triggering, reactive override, and heterogeneous memory each contribute to success and recovery, supporting selective reasoning rather than reasoning at every step.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SPIKE, an adaptive dual-controller framework for cost-efficient long-horizon multimodal game agents. A low-frequency Strategic Controller handles global planning, failure analysis, and recovery; a Reactive Controller manages fast local execution under token budgets; an Event Trigger decides escalation based on visual change, task progress, repeated actions, and failure signals; and Hierarchical Memory (SA-MB for short-term reuse plus SA-KG for structured evidence) supports context retrieval. On the Lite-100 split of StarDojo the method reports +5.0 pp success rate (38.5% relative) and +9.3 pp Budgeted SR (75.6% relative) versus strongest baselines, together with 54.9% token and 40.8% latency reductions. Ablations are cited as evidence that event triggering, reactive override, and heterogeneous memory each contribute to performance.

Significance. If the empirical gains hold under rigorous controls, the selective-reasoning design could be moderately significant for resource-constrained long-horizon agents, showing that strategic deliberation can be reused across stable segments rather than invoked every step. The hierarchical memory separation is a practical contribution. The work does not offer parameter-free derivations, machine-checked proofs, or falsifiable predictions beyond the reported benchmark numbers.

major comments (2)
  1. Abstract and experimental results: the claimed 5.0 pp SR and 54.9% token reductions are presented without any report of the number of runs, statistical significance tests, variance, exact baseline implementations, or data-split procedures. These omissions are load-bearing for assessing whether the gains are robust or could arise from implementation details or split-specific tuning.
  2. Event Trigger (method description): the efficiency claim rests on the trigger correctly escalating only at true boundaries while avoiding drift or wasted calls. Although ablations are said to show its contribution to success/recovery, no precision, recall, false-positive/negative rates, or per-task trigger statistics are supplied. Without these metrics it is impossible to attribute the token savings specifically to selective reasoning rather than to the reactive override or heterogeneous memory alone.
minor comments (1)
  1. Abstract: the Lite-100 split and StarDojo benchmark should be briefly characterized (task count, horizon length, observation modality) so readers can judge applicability without immediately consulting the full experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable comments on our work. We address each of the major comments below and have made revisions to the manuscript to improve clarity and provide additional details where requested.

read point-by-point responses
  1. Referee: Abstract and experimental results: the claimed 5.0 pp SR and 54.9% token reductions are presented without any report of the number of runs, statistical significance tests, variance, exact baseline implementations, or data-split procedures. These omissions are load-bearing for assessing whether the gains are robust or could arise from implementation details or split-specific tuning.

    Authors: We agree with the referee that additional experimental details are necessary to substantiate the reported improvements. In the revised manuscript, we have added information specifying that results are averaged over 5 independent runs with different random seeds. We report the standard deviations alongside the means for success rate and token consumption. We have performed and included results from paired t-tests, confirming statistical significance (p < 0.05) for the key gains. Furthermore, we have expanded the experimental setup to describe the exact implementations of the baselines and the procedure for the Lite-100 data split. revision: yes

  2. Referee: Event Trigger (method description): the efficiency claim rests on the trigger correctly escalating only at true boundaries while avoiding drift or wasted calls. Although ablations are said to show its contribution to success/recovery, no precision, recall, false-positive/negative rates, or per-task trigger statistics are supplied. Without these metrics it is impossible to attribute the token savings specifically to selective reasoning rather than to the reactive override or heterogeneous memory alone.

    Authors: We appreciate this observation. Our ablations already isolate the effect of the Event Trigger by comparing the full system against a variant without it, showing its contribution to both success and token efficiency. To further address the concern, we have included in the revised version per-task trigger statistics, such as the average number of escalations per episode and examples of trigger activations. However, since the benchmark does not provide ground-truth labels for event boundaries, we are unable to compute precision and recall directly; instead, we rely on the ablation results and qualitative analysis to support the attribution of savings to selective reasoning. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance on external benchmark with no derived predictions or self-referential equations

full rationale

The paper presents an architectural framework (dual controllers, event trigger, hierarchical memory) and validates it via empirical metrics on the Lite-100 split of StarDojo. Success rates, token reductions, and ablation contributions are measured outcomes on an external benchmark rather than quantities obtained by fitting parameters inside the paper's own equations or by renaming inputs as outputs. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claims remain falsifiable against held-out tasks and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the untested premise that an event detector can be built that triggers strategic reasoning at useful moments without introducing new failure modes. No free parameters or invented physical entities are mentioned; the main added structure is the dual-controller split itself.

axioms (1)
  • domain assumption An Event Trigger based on visual change, task progress, repeated actions, and failure signals can decide when to escalate from reactive to strategic control.
    This premise is invoked to justify the adaptive switching mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5867 in / 1426 out tokens · 46430 ms · 2026-05-20T10:28:24.903560+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 9 internal anchors

  1. [1]

    Scaling instructable agents across many simulated worlds

    MariaAbiRaad, ArunAhuja, CatarinaBarros, FredericBesse, AndrewBolt, AdrianBolton, BethanieBrownfield, Gavin Buttimore, Max Cant, Sarah Chakera, et al. Scaling instructable agents across many simulated worlds. arXiv preprint arXiv:2404.10179, 2024. 3

  2. [2]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, et al. Do as I can, not as I say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022. 2, 3

  3. [3]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025. 3

  4. [4]

    Mind2Web: Towards a Generalist Agent for the Web

    Xiang Deng, Yu Gu, Boyuan Zheng, et al. Mind2Web: Towards a generalist agent for the web.arXiv preprint arXiv:2306.06070, 2023. 1, 3

  5. [5]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A Graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. 4

  6. [6]

    MineDojo: Building open-ended embodied agents with internet- scale knowledge

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. MineDojo: Building open-ended embodied agents with internet- scale knowledge. InAdvances in Neural Information Processing Systems, 2022. 3

  7. [7]

    Reasoning with language model is planning with world model

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. 3

  8. [8]

    H i A gent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

    Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. HiAgent: Hierarchical working memory management for solving long-horizon agent tasks with large language model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 32779–32798, 2025. doi: 10.18653/v1/2025.acl-long.1575. 3 11

  9. [9]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022. 2, 3

  10. [10]

    HippoRAG: Neurobiologically inspired long-term memory for large language models

    Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InAdvances in Neural Information Processing Systems,

  11. [11]

    Farrar, Straus and Giroux, 2011

    Daniel Kahneman.Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011. 5

  12. [12]

    Retrieval-augmented generation for knowledge- intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020. 4

  13. [13]

    TextAtari: 100K frames game playing with language agents.arXiv preprint arXiv:2506.04098, 2025

    Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, and Bo Jin. TextAtari: 100K frames game playing with language agents.arXiv preprint arXiv:2506.04098, 2025. 3

  14. [14]

    Optimus-3: Towards generalist multi- modal minecraft agents with scalable task experts.arXiv preprint arXiv:2506.10357, 2025f

    Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Yaowei Wang, and Liqiang Nie. Optimus-3: Dual-router aligned mixture-of-experts agent with dual-granularity reasoning-aware policy optimization.arXiv preprint arXiv:2506.10357, 2025. 2

  15. [15]

    SwiftSage: A generative agent with fast and slow thinking for complex interactive tasks

    Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. SwiftSage: A generative agent with fast and slow thinking for complex interactive tasks. InAdvances in Neural Information Processing Systems, 2023. 2, 3

  16. [16]

    AgentBench: Evaluating LLMs as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representatio...

  17. [17]

    NitroGen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026

    Loic Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, Yisong Yue, Yejin Choi, Yuke Zhu, and Linxi Fan. NitroGen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026. 1, 3

  18. [18]

    GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

    Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, and Mike Zheng Shou. GameWorld: Towards standardized and verifiable evaluation of multimodal game agents.arXiv preprint arXiv:2604.07429, 2026. 3

  19. [19]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023. 2, 3

  20. [20]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InACM Symposium on User Interface Software and Technology, 2023. 2, 3

  21. [21]

    ADaPT: As-needed decomposition and planning with language models

    Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. ADaPT: As-needed decomposition and planning with language models. InFindings of ACL: NAACL 2024, pages 4226–4252. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024.findings-naacl.264. URLhttps://aclanthology.org/2024.findi...

  22. [22]

    Narasimhan, and Shunyu Yao

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023. 3, 7, 8, 10

  23. [23]

    ALFWorld: Aligning text and embodied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021. 3

  24. [24]

    AdaPlanner: Adaptive planning from feedback with language models.arXiv preprint arXiv:2305.16653, 2023

    Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. AdaPlanner: Adaptive planning from feedback with language models.arXiv preprint arXiv:2305.16653, 2023. 3

  25. [25]

    Cradle: Empowering foundation agents towards general computer control,

    Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, et al. CRADLE: Empowering foundation agents towards general computer control.arXiv preprint arXiv:2403.03186, 2024. 1, 2, 3, 7, 8, 10 12

  26. [26]

    Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025

    Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, and Bo An. StarDojo: Benchmarking open-ended behaviors of agentic multimodal LLMs in production-living simulations with Stardew Valley.arXiv preprint arXiv:2507.07445, 2025. 1, 3, 7, 8, 10

  27. [27]

    Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

    Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, and Guang Shi. Lumine: An open recipe for building generalist agents in 3D open worlds.arXiv preprint arXiv:2511.08892, 2025. 3

  28. [28]

    Experience-Driven Exploration for Efficient API-Free AI Agents

    Chenwei Tang, Lin Long, Xinyu Liu, Jingyu Xing, Zizhou Wang, Joey Tianyi Zhou, Jiawei Du, Liangli Zhen, and Jiancheng Lv. SAG-Agent: Enabling long-horizon reasoning in strategy games via dynamic knowledge graphs. arXiv preprint arXiv:2510.15259, 2025. Earlier versions circulated as “Experience-Driven Exploration for Efficient API-Free AI Agents”. 4, 7

  29. [29]

    Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024. 1, 3, 7, 8, 10

  30. [30]

    Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents

    Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. InAdvances in Neural Information Processing Systems, 2023. 3

  31. [31]

    Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models,

    Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. JARVIS-1: Open-world multi-task agents with memory- augmented multimodal language models.arXiv preprint arXiv:2311.05997, 2023. 3

  32. [32]

    ScenDroid: A scenario-level benchmark for long-horizon, time-evolving GUI agents, 2026

    Zhe Wu, Yongxin Kang, Dabin Sheng, Junliang Xing, Guokun Wu, Derek Yuen, Donglin Mo, Yuheng Jing, Kai Li, Weilin Luo, Kun Shao, and Yuanchun Shi. ScenDroid: A scenario-level benchmark for long-horizon, time-evolving GUI agents, 2026. URLhttps://openreview.net/forum?id=hBTsLjjw48. OpenReview LLA 2026 poster. 3

  33. [33]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, 2023. 3

  34. [34]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations,

  35. [35]

    AgentOdyssey: Open-ended long-horizon text game generation for test-time continual learning agents, 2026

    Zheyuan Zhang, Zehao Wen, Alvin Zhang, Andrew Wang, Jianwen Xie, Daniel Khashabi, and Tianmin Shu. AgentOdyssey: Open-ended long-horizon text game generation for test-time continual learning agents, 2026. URLhttps://agentodyssey.github.io/. Project page. 4

  36. [36]

    V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

    Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, and Lijuan Wang. V-MAGE: A game evaluation framework for assessing vision-centric capabilities in multimodal large language models.arXiv preprint arXiv:2504.06148, 2025. 3

  37. [37]

    Language agent tree search unifies reasoning, acting, and planning in language models

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 62138–62160,

  38. [38]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, et al. WebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 1, 3 13 Appendix Contents A Metric and Reproducibility Details in STARDOJO15 A.1 Step-budget SR versus Budgeted SR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...