pith. machine review for the scientific record. sign in

arxiv: 2605.12755 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

State-Centric Decision Process

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:33 UTC · model grok-4.3

classification 💻 cs.AI
keywords State-Centric Decision Processlanguage agentsdecision processesplanning benchmarksreasoning taskscertified trajectoriestraining-free methods
0
0 comments X

The pith

The State-Centric Decision Process lets agents build their own certified state spaces from raw language observations using natural-language predicates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SDP as a way to supply the missing structure that standard decision-process analysis needs when environments emit only text. The agent repeatedly commits to a predicate that describes the next desired world state, acts to satisfy it, and checks whether the received observation matches the predicate. Successful predicates become certified states, and the resulting sequence supplies a task-induced state space, an observation-to-state mapping, certified transitions, and a termination criterion. On five benchmarks the method records the strongest training-free scores, with the margin growing as horizons lengthen, and the certified trajectories additionally allow per-predicate credit assignment and failure localization.

Core claim

SDP is a runtime framework that constructs the four objects language environments lack—an explicit state space, an observation-to-state mapping, certified transitions, and a termination criterion—by requiring the agent to commit to a natural-language predicate at every step, act to make it true, and certify the observation against it. The resulting trajectories support standard analyses plus diagnostics unavailable to purely reactive agents.

What carries the argument

The predicate commitment-and-verification loop, which turns each raw observation into a certified state only after the agent has declared and checked a natural-language description of the target condition.

If this is right

  • SDP records the highest training-free scores on all five evaluated benchmarks.
  • The performance gap over reactive baselines increases with longer task horizons.
  • Certified trajectories enable per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The certified states could be fed directly into existing symbolic planners to replace exhaustive search with guided predicate sequences.
  • Verification accuracy might be raised by cross-checking a predicate against multiple observations or external oracles.
  • The same predicate loop could be applied to embodied agents whose sensors return text descriptions of physical scenes.

Load-bearing premise

An agent can reliably choose a natural-language predicate describing the desired next state and correctly verify whether the subsequent observation satisfies that predicate without external supervision.

What would settle it

A long-horizon benchmark run in which SDP produces verification errors on more than half the predicates and ends with lower success rates than a simple reactive baseline would falsify the claim that the loop reliably supplies usable certified states.

Figures

Figures reproduced from arXiv: 2605.12755 by Mahdi Imani, Mohsen Imani, Ryozo Masukawa, Sanggeon Yun, Sungheon Jeong.

Figure 1
Figure 1. Figure 1: The SDP execution structure. PROPOSE builds the predicate chain from s0 to g. REAL￾IZE acts toward the next predicate. VALIDATE checks the observation and is the sole interface for O. When failures at a target exceed b, REPLAN discards and builds a new state. PROPOSE commits to the outer choice before execution begins, REALIZE solves the inner problem at each step, and VALIDATE supplies the per-step signal… view at source ↗
Figure 2
Figure 2. Figure 2: Example SDP run, with a cascade (k ≥ 2) and a replan opportunity. 3.3 What an SDP run produces An SDP run terminates with more than an answer to the task. It returns a structured artifact that records every certified state, the action that produced it, and the cascade depth recorded by each validation. The artifact has the form [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Anatomy of SDP trajectories. (a) Cascade depth. (b) Score rate vs. replan count. (c) Certified progress in failed runs. (d) VALIDATE calibration. 4.2 Anatomy of SDP trajectories The previous results evaluated SDP as a prompting strategy. This section examines the artifact it produces. The certified trajectory carries structured information that a reactive trajectory does not, namely whether single actions … view at source ↗
read the original abstract

Language environments such as web browsers, code terminals, and interactive simulations emit raw text rather than states, and provide none of the runtime structure that MDP analysis requires. No explicit state space, no observation-to-state mapping, no certified transitions, and no termination criterion. We introduce the State-Centric Decision Process (SDP), a runtime framework that constructs these missing inputs by having the agent build them, predicate by predicate, as it acts. At each step the agent commits to a natural-language predicate describing how the world should look, takes an action to make it true, and checks the observation against it. Predicates that pass become certified states, and the resulting trajectory carries the four objects language environments do not provide, namely a task-induced state space, an observation-to-state mapping, certified transitions, and a termination criterion. We evaluate SDP on five benchmarks spanning planning, scientific exploration, web reasoning, and multi-hop question answering. SDP achieves the best training-free results on all five, with the advantage widening as the horizon grows. The certified trajectories additionally support analyses unavailable to reactive agents, including per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the State-Centric Decision Process (SDP), a runtime framework for language environments (e.g., browsers, code terminals) that lack explicit MDP structure. The agent commits to natural-language predicates describing desired world states, takes actions to satisfy them, and verifies observations against the predicates; passing predicates become certified states. This yields a task-induced state space, observation-to-state mapping, certified transitions, and termination criterion. SDP is evaluated on five benchmarks spanning planning, scientific exploration, web reasoning, and multi-hop QA, claiming the best training-free results with the advantage widening as the horizon grows. The certified trajectories enable analyses unavailable to reactive agents, including per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.

Significance. If the self-verification step is reliable, SDP offers a substantive contribution by retrofitting formal decision-process structure onto unstructured language agents without training. This could enable interpretable long-horizon reasoning and diagnostic tools in practical domains. The reported widening performance gap with horizon length, if substantiated with quantitative data, would strengthen the case for SDP over purely reactive baselines.

major comments (2)
  1. [SDP Construction] SDP Construction / Predicate Verification: The framework certifies states and transitions solely via the same language model verifying its own generated predicates against observations. No external oracle, ground-truth checker, or measured verification accuracy (e.g., false-positive rate on the NLI task) is described. This assumption is load-bearing for the 'certified' label, the constructed state space, termination criterion, and all downstream empirical claims and analyses.
  2. [Experimental Evaluation] Experimental Evaluation: The claim of achieving the best training-free results on all five benchmarks (with widening advantage as horizon grows) is presented without quantitative tables, specific metrics, error bars, baseline details, or any reporting on predicate verification accuracy in the evaluation. This prevents assessment of whether the superiority holds after accounting for potential verification errors.
minor comments (1)
  1. [Introduction] The paper would benefit from an early concrete example (e.g., one full predicate-verify cycle on a benchmark) to clarify the observation-to-state mapping and termination criterion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline revisions that will strengthen the manuscript's clarity and empirical support.

read point-by-point responses
  1. Referee: [SDP Construction] SDP Construction / Predicate Verification: The framework certifies states and transitions solely via the same language model verifying its own generated predicates against observations. No external oracle, ground-truth checker, or measured verification accuracy (e.g., false-positive rate on the NLI task) is described. This assumption is load-bearing for the 'certified' label, the constructed state space, termination criterion, and all downstream empirical claims and analyses.

    Authors: We agree that SDP relies on the language model's self-verification for certification, as no external oracle is available in the unstructured language environments we target. This is a deliberate design choice to enable structured decision processes without additional infrastructure. In the revision we will add a dedicated subsection (new Section 4.3) that empirically measures verification reliability: we will sample 200 predicate-observation pairs across the benchmarks, obtain human annotations for correctness, and report precision, recall, and false-positive rates. This directly quantifies the load-bearing assumption and supports the certified state space claims. revision: yes

  2. Referee: [Experimental Evaluation] Experimental Evaluation: The claim of achieving the best training-free results on all five benchmarks (with widening advantage as horizon grows) is presented without quantitative tables, specific metrics, error bars, baseline details, or any reporting on predicate verification accuracy in the evaluation. This prevents assessment of whether the superiority holds after accounting for potential verification errors.

    Authors: The full manuscript contains quantitative tables in Section 5 comparing SDP against training-free baselines on all five benchmarks, along with a plot (Figure 3) demonstrating the widening performance gap as horizon length increases. We will revise the experimental section to include error bars from five independent runs per benchmark, expanded baseline implementation details, and a new paragraph linking the verification accuracy results (from the added Section 4.3) to the main performance claims. This will allow readers to evaluate robustness after accounting for verification errors. revision: partial

Circularity Check

0 steps flagged

No circularity: SDP is a runtime construction independent of fitted parameters or self-referential derivations

full rationale

The paper introduces SDP as a framework where an agent constructs state spaces, mappings, transitions, and termination criteria on the fly via natural-language predicates that are committed to and then checked against observations. No equations appear in the provided text that reduce any claimed result to a fitted input or self-definition. No self-citations are invoked to justify core premises, and the central claims rest on the runtime behavior rather than any parameter estimation or renaming of prior results. The derivation chain is self-contained against external benchmarks and does not reduce any prediction to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that natural-language predicates can serve as verifiable state descriptors in text environments. No free parameters or invented physical entities are introduced; the main addition is the SDP process itself.

axioms (1)
  • domain assumption Natural language predicates can be reliably committed to and verified against raw text observations by the agent
    The entire construction of certified states depends on accurate predicate evaluation without external ground truth.
invented entities (1)
  • State-Centric Decision Process (SDP) no independent evidence
    purpose: To dynamically construct state space, transitions, and termination from text observations
    New runtime framework introduced to supply the four missing objects that language environments lack.

pith-pipeline@v0.9.0 · 5509 in / 1373 out tokens · 40942 ms · 2026-05-14T19:33:46.948475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 7 internal anchors

  1. [1]

    State abstractions for lifelong rein- forcement learning

    David Abel, Dilip Arumugam, Lucas Lehnert, and Michael Littman. State abstractions for lifelong rein- forcement learning. InInternational conference on machine learning, pages 10–19. PMLR, 2018

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  4. [4]

    Hindsight experience replay.Advances in neural information processing systems, 30, 2017

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc- Grew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.Advances in neural information processing systems, 30, 2017

  5. [5]

    Agent context protocols enhance collective in- ference.arXiv preprint arXiv:2505.14569, 2025

    Devansh Bhardwaj, Arjun Beniwal, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Karthik R Narasimhan, Ameet Deshpande, and Vishvak Murahari. Agent context protocols enhance collective in- ference.arXiv preprint arXiv:2505.14569, 2025

  6. [6]

    Reinforcement learning for mapping instructions to actions

    Satchuthananthavale RK Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. Reinforcement learning for mapping instructions to actions. InProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 82–90, 2009

  7. [7]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  8. [8]

    arXiv preprint arXiv:2504.16563 , year=

    Junjie Chen, Haitao Li, Jingli Yang, Yiqun Liu, and Qingyao Ai. Enhancing llm-based agents via global planning and hierarchical execution.arXiv preprint arXiv:2504.16563, 2025

  9. [9]

    Atlas: Constraints-aware multi- agent collaboration for real-world travel planning.arXiv preprint arXiv:2509.25586, 2025

    Jihye Choi, Jinsung Yoon, Jiefeng Chen, Somesh Jha, and Tomas Pfister. Atlas: Constraints-aware multi- agent collaboration for real-world travel planning.arXiv preprint arXiv:2509.25586, 2025

  10. [10]

    Reasonplanner: Enhancing autonomous planning in dynamic environments with temporal knowledge graphs and llms.arXiv preprint arXiv:2410.09252, 2024

    Minh Pham Dinh, Munira Syed, G Yankoski Michael, and Trenton W Ford. Reasonplanner: Enhancing autonomous planning in dynamic environments with temporal knowledge graphs and llms.arXiv preprint arXiv:2410.09252, 2024. 10

  11. [11]

    arXiv preprint arXiv:2503.09572 , year =

    Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv preprint arXiv:2503.09572, 2025

  12. [12]

    arXiv preprint arXiv:2411.04468 , year=

    Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, et al. Magentic-one: A generalist multi- agent system for solving complex tasks.arXiv preprint arXiv:2411.04468, 2024

  13. [13]

    Equivalence notions and model minimization in markov decision processes.Artificial intelligence, 147(1-2):163–223, 2003

    Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in markov decision processes.Artificial intelligence, 147(1-2):163–223, 2003

  14. [14]

    Mirror: Multi-agent intra-and inter- reflection for optimized reasoning in tool learning.arXiv preprint arXiv:2505.20670, 2025

    Zikang Guo, Benfeng Xu, Xiaorui Wang, and Zhendong Mao. Mirror: Multi-agent intra-and inter- reflection for optimized reasoning in tool learning.arXiv preprint arXiv:2505.20670, 2025

  15. [15]

    Reasoning with language model is planning with world model

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023

  16. [16]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

  17. [17]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through plan- ning with language models.arXiv preprint arXiv:2207.05608, 2022

  18. [18]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  19. [19]

    Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

    Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

  20. [20]

    Shifting from ranking to set selection for retrieval augmented generation

    Dahyun Lee, Yongrae Jo, Haeju Park, and Moontae Lee. Shifting from ranking to set selection for retrieval augmented generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17606–17619, 2025

  21. [21]

    Towards a unified theory of state abstraction for mdps.AI&M, 1(2):3, 2006

    Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for mdps.AI&M, 1(2):3, 2006

  22. [22]

    Beyond entangled planning: Task-decoupled planning for long-horizon agents.arXiv preprint arXiv:2601.07577, 2026

    Yunfan Li, Bingbing Xu, Xueyun Tian, Xiucheng Xu, and Huawei Shen. Beyond entangled planning: Task-decoupled planning for long-horizon agents.arXiv preprint arXiv:2601.07577, 2026

  23. [23]

    Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems, 36:23813–23825, 2023

    Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithvi- raj Ammanabrolu, Yejin Choi, and Xiang Ren. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems, 36:23813–23825, 2023

  24. [24]

    Positive experience reflection for agents in interactive text environments

    Philip Lippmann, Matthijs TJ Spaan, and Jie Yang. Positive experience reflection for agents in interactive text environments. InProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 131–142, 2025

  25. [25]

    Reason for future, act for now: A principled framework for autonomous llm agents with provable sample efficiency

    Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, and Zhaoran Wang. Reason for future, act for now: A principled framework for autonomous llm agents with provable sample efficiency. arXiv preprint arXiv:2309.17382, 2023

  26. [26]

    Clin: A continually learning language agent for rapid task adaptation and generalization

    Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. Clin: A continually learning language agent for rapid task adaptation and generalization.arXiv preprint arXiv:2310.10134, 2023

  27. [27]

    A survey of pomdp solution techniques.environment, 2(10):1–12, 2000

    Kevin P Murphy et al. A survey of pomdp solution techniques.environment, 2(10):1–12, 2000

  28. [28]

    Prism: Agentic retrieval with llms for multi-hop question answering.arXiv preprint arXiv:2510.14278, 2025

    Md Mahadi Hasan Nahid and Davood Rafiei. Prism: Agentic retrieval with llms for multi-hop question answering.arXiv preprint arXiv:2510.14278, 2025

  29. [29]

    RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

    Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023. 11

  30. [30]

    Measuring and narrowing the compositionality gap in language models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

  31. [31]

    John Wiley & Sons, 2014

    Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

  32. [32]

    Infogent: An agent-based framework for web information aggregation

    Revanth Gangi Reddy, Sagnik Mukherjee, Jeonghwan Kim, Zhenhailong Wang, Dilek Hakkani-Tur, and Heng Ji. Infogent: An agent-based framework for web information aggregation. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 5745–5758, 2025

  33. [33]

    Universal value function approximators

    Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International conference on machine learning, pages 1312–1320. PMLR, 2015

  34. [34]

    Toolformer: Language models can teach them- selves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach them- selves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

  35. [35]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  36. [36]

    Mastering the game of go without human knowledge.nature, 550(7676):354–359, 2017

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.nature, 550(7676):354–359, 2017

  37. [37]

    Cognitive architectures for language agents.Transactions on Machine Learning Research, 2023

    Theodore Sumers, Shunyu Yao, Karthik R Narasimhan, and Thomas L Griffiths. Cognitive architectures for language agents.Transactions on Machine Learning Research, 2023

  38. [38]

    Adaplanner: Adaptive planning from feedback with language models.Advances in neural information processing systems, 36:58202– 58245, 2023

    Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive planning from feedback with language models.Advances in neural information processing systems, 36:58202– 58245, 2023

  39. [39]

    Is chatgpt good at search? investigating large language models as re-ranking agents

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is chatgpt good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 14918– 14937, 2023

  40. [40]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998

  41. [41]

    Musique: Multihop ques- tions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop ques- tions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  42. [42]

    Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 10014– 10037, 2023

  43. [43]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

  44. [44]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

  45. [45]

    Plan- and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan- and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

  46. [46]

    Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

    Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022. 12

  47. [47]

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao

    Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023

  48. [48]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural infor- mation processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural infor- mation processing systems, 35:24824–24837, 2022

  49. [49]

    Partially observable markov decision processes for spoken dialog systems.Computer Speech & Language, 21(2):393–422, 2007

    Jason D Williams and Steve Young. Partially observable markov decision processes for spoken dialog systems.Computer Speech & Language, 21(2):393–422, 2007

  50. [50]

    Simple statistical gradient-following algorithms for connectionist reinforcement learn- ing.Machine learning, 8(3):229–256, 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learn- ing.Machine learning, 8(3):229–256, 1992

  51. [51]

    The rise and potential of large language model based agents: A survey

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey. Science China Information Sciences, 68(2):121101, 2025

  52. [52]

    Travelplanner: A benchmark for real-world planning with language agents.arXiv preprint arXiv:2402.01622, 2024

    Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents.arXiv preprint arXiv:2402.01622, 2024

  53. [53]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2369– 2380, 2018

  54. [54]

    Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

  55. [55]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  56. [56]

    Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8938–8968, 2024

  57. [57]

    Pomdp-based statistical spoken dialog systems: A review.Proceedings of the IEEE, 101(5):1160–1179, 2013

    Steve Young, Milica Gaši ´c, Blaise Thomson, and Jason D Williams. Pomdp-based statistical spoken dialog systems: A review.Proceedings of the IEEE, 101(5):1160–1179, 2013

  58. [58]

    Evoagent: Towards automatic multi-agent generation via evolutionary algorithms

    Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Dongsheng Li, and Deqing Yang. Evoagent: Towards automatic multi-agent generation via evolutionary algorithms. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6192–6217, 2025

  59. [59]

    Planning with multi-constraints via collaborative language agents

    Cong Zhang, Xin Deik Goh, Dexun Li, Hao Zhang, and Yong Liu. Planning with multi-constraints via collaborative language agents. InProceedings of the 31st International Conference on Computational Linguistics, pages 10054–10082, 2025

  60. [60]

    Expel: Llm agents are experiential learners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 38, pages 19632–19642, 2024

  61. [61]

    Boyuan Zheng, Michael Y

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024

  62. [62]

    Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Lan- guage agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

  63. [63]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 13 A Implementation details A.1 Common Settings All five benchmarks share the same SDP core loop; only the adapter...