arxiv: 2605.12755 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

State-Centric Decision Process

Sungheon Jeong , Ryozo Masukawa , Sanggeon Yun , Mahdi Imani , Mohsen Imani

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:33 UTC · model grok-4.3

classification 💻 cs.AI

keywords State-Centric Decision Processlanguage agentsdecision processesplanning benchmarksreasoning taskscertified trajectoriestraining-free methods

0 comments

The pith

The State-Centric Decision Process lets agents build their own certified state spaces from raw language observations using natural-language predicates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SDP as a way to supply the missing structure that standard decision-process analysis needs when environments emit only text. The agent repeatedly commits to a predicate that describes the next desired world state, acts to satisfy it, and checks whether the received observation matches the predicate. Successful predicates become certified states, and the resulting sequence supplies a task-induced state space, an observation-to-state mapping, certified transitions, and a termination criterion. On five benchmarks the method records the strongest training-free scores, with the margin growing as horizons lengthen, and the certified trajectories additionally allow per-predicate credit assignment and failure localization.

Core claim

SDP is a runtime framework that constructs the four objects language environments lack—an explicit state space, an observation-to-state mapping, certified transitions, and a termination criterion—by requiring the agent to commit to a natural-language predicate at every step, act to make it true, and certify the observation against it. The resulting trajectories support standard analyses plus diagnostics unavailable to purely reactive agents.

What carries the argument

The predicate commitment-and-verification loop, which turns each raw observation into a certified state only after the agent has declared and checked a natural-language description of the target condition.

If this is right

SDP records the highest training-free scores on all five evaluated benchmarks.
The performance gap over reactive baselines increases with longer task horizons.
Certified trajectories enable per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The certified states could be fed directly into existing symbolic planners to replace exhaustive search with guided predicate sequences.
Verification accuracy might be raised by cross-checking a predicate against multiple observations or external oracles.
The same predicate loop could be applied to embodied agents whose sensors return text descriptions of physical scenes.

Load-bearing premise

An agent can reliably choose a natural-language predicate describing the desired next state and correctly verify whether the subsequent observation satisfies that predicate without external supervision.

What would settle it

A long-horizon benchmark run in which SDP produces verification errors on more than half the predicates and ends with lower success rates than a simple reactive baseline would falsify the claim that the loop reliably supplies usable certified states.

Figures

Figures reproduced from arXiv: 2605.12755 by Mahdi Imani, Mohsen Imani, Ryozo Masukawa, Sanggeon Yun, Sungheon Jeong.

**Figure 1.** Figure 1: The SDP execution structure. PROPOSE builds the predicate chain from s0 to g. REALIZE acts toward the next predicate. VALIDATE checks the observation and is the sole interface for O. When failures at a target exceed b, REPLAN discards and builds a new state. PROPOSE commits to the outer choice before execution begins, REALIZE solves the inner problem at each step, and VALIDATE supplies the per-step signal… view at source ↗

**Figure 2.** Figure 2: Example SDP run, with a cascade (k ≥ 2) and a replan opportunity. 3.3 What an SDP run produces An SDP run terminates with more than an answer to the task. It returns a structured artifact that records every certified state, the action that produced it, and the cascade depth recorded by each validation. The artifact has the form [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Anatomy of SDP trajectories. (a) Cascade depth. (b) Score rate vs. replan count. (c) Certified progress in failed runs. (d) VALIDATE calibration. 4.2 Anatomy of SDP trajectories The previous results evaluated SDP as a prompting strategy. This section examines the artifact it produces. The certified trajectory carries structured information that a reactive trajectory does not, namely whether single actions … view at source ↗

read the original abstract

Language environments such as web browsers, code terminals, and interactive simulations emit raw text rather than states, and provide none of the runtime structure that MDP analysis requires. No explicit state space, no observation-to-state mapping, no certified transitions, and no termination criterion. We introduce the State-Centric Decision Process (SDP), a runtime framework that constructs these missing inputs by having the agent build them, predicate by predicate, as it acts. At each step the agent commits to a natural-language predicate describing how the world should look, takes an action to make it true, and checks the observation against it. Predicates that pass become certified states, and the resulting trajectory carries the four objects language environments do not provide, namely a task-induced state space, an observation-to-state mapping, certified transitions, and a termination criterion. We evaluate SDP on five benchmarks spanning planning, scientific exploration, web reasoning, and multi-hop question answering. SDP achieves the best training-free results on all five, with the advantage widening as the horizon grows. The certified trajectories additionally support analyses unavailable to reactive agents, including per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDP builds states from self-verified predicates in language agents, enabling new analyses but resting on untested verification accuracy.

read the letter

The main takeaway is that SDP constructs explicit states in language environments by having the agent generate and verify natural-language predicates on the fly. This turns unstructured text trajectories into something closer to an MDP with certified transitions and a termination condition. What the paper does well is show how this structure supports analyses that reactive agents miss, such as per-predicate credit assignment and failure localization. It evaluates on five diverse benchmarks and reports the best training-free results, with the advantage increasing for longer horizons. The idea of building the state space dynamically rather than assuming it upfront is a clear departure from standard MDP setups and from pure next-token agent designs. The soft spot is the verification step. The agent checks its own predicates against observations using the same model, with no external ground truth or reported accuracy for those checks. Since this is a natural language inference problem, false positives could create invalid states and skew all the downstream metrics and analyses. The abstract lacks quantitative details on verification success or error rates, which makes the empirical claims difficult to assess without the full results and ablations. If the paper includes those, it would address the main concern. This paper is for researchers working on LLM-based agents in interactive settings like web navigation or code execution, especially those interested in adding structure for better debugging and credit assignment. A reader looking for ways to move beyond black-box prompting would get practical value from the framework and the new analysis tools it enables. I think it deserves peer review. The novelty in the runtime construction is real, and the potential impact on long-horizon reliability is worth referee feedback, even if the verification reliability needs more evidence.

Referee Report

2 major / 1 minor

Summary. The paper introduces the State-Centric Decision Process (SDP), a runtime framework for language environments (e.g., browsers, code terminals) that lack explicit MDP structure. The agent commits to natural-language predicates describing desired world states, takes actions to satisfy them, and verifies observations against the predicates; passing predicates become certified states. This yields a task-induced state space, observation-to-state mapping, certified transitions, and termination criterion. SDP is evaluated on five benchmarks spanning planning, scientific exploration, web reasoning, and multi-hop QA, claiming the best training-free results with the advantage widening as the horizon grows. The certified trajectories enable analyses unavailable to reactive agents, including per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.

Significance. If the self-verification step is reliable, SDP offers a substantive contribution by retrofitting formal decision-process structure onto unstructured language agents without training. This could enable interpretable long-horizon reasoning and diagnostic tools in practical domains. The reported widening performance gap with horizon length, if substantiated with quantitative data, would strengthen the case for SDP over purely reactive baselines.

major comments (2)

[SDP Construction] SDP Construction / Predicate Verification: The framework certifies states and transitions solely via the same language model verifying its own generated predicates against observations. No external oracle, ground-truth checker, or measured verification accuracy (e.g., false-positive rate on the NLI task) is described. This assumption is load-bearing for the 'certified' label, the constructed state space, termination criterion, and all downstream empirical claims and analyses.
[Experimental Evaluation] Experimental Evaluation: The claim of achieving the best training-free results on all five benchmarks (with widening advantage as horizon grows) is presented without quantitative tables, specific metrics, error bars, baseline details, or any reporting on predicate verification accuracy in the evaluation. This prevents assessment of whether the superiority holds after accounting for potential verification errors.

minor comments (1)

[Introduction] The paper would benefit from an early concrete example (e.g., one full predicate-verify cycle on a benchmark) to clarify the observation-to-state mapping and termination criterion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline revisions that will strengthen the manuscript's clarity and empirical support.

read point-by-point responses

Referee: [SDP Construction] SDP Construction / Predicate Verification: The framework certifies states and transitions solely via the same language model verifying its own generated predicates against observations. No external oracle, ground-truth checker, or measured verification accuracy (e.g., false-positive rate on the NLI task) is described. This assumption is load-bearing for the 'certified' label, the constructed state space, termination criterion, and all downstream empirical claims and analyses.

Authors: We agree that SDP relies on the language model's self-verification for certification, as no external oracle is available in the unstructured language environments we target. This is a deliberate design choice to enable structured decision processes without additional infrastructure. In the revision we will add a dedicated subsection (new Section 4.3) that empirically measures verification reliability: we will sample 200 predicate-observation pairs across the benchmarks, obtain human annotations for correctness, and report precision, recall, and false-positive rates. This directly quantifies the load-bearing assumption and supports the certified state space claims. revision: yes
Referee: [Experimental Evaluation] Experimental Evaluation: The claim of achieving the best training-free results on all five benchmarks (with widening advantage as horizon grows) is presented without quantitative tables, specific metrics, error bars, baseline details, or any reporting on predicate verification accuracy in the evaluation. This prevents assessment of whether the superiority holds after accounting for potential verification errors.

Authors: The full manuscript contains quantitative tables in Section 5 comparing SDP against training-free baselines on all five benchmarks, along with a plot (Figure 3) demonstrating the widening performance gap as horizon length increases. We will revise the experimental section to include error bars from five independent runs per benchmark, expanded baseline implementation details, and a new paragraph linking the verification accuracy results (from the added Section 4.3) to the main performance claims. This will allow readers to evaluate robustness after accounting for verification errors. revision: partial

Circularity Check

0 steps flagged

No circularity: SDP is a runtime construction independent of fitted parameters or self-referential derivations

full rationale

The paper introduces SDP as a framework where an agent constructs state spaces, mappings, transitions, and termination criteria on the fly via natural-language predicates that are committed to and then checked against observations. No equations appear in the provided text that reduce any claimed result to a fitted input or self-definition. No self-citations are invoked to justify core premises, and the central claims rest on the runtime behavior rather than any parameter estimation or renaming of prior results. The derivation chain is self-contained against external benchmarks and does not reduce any prediction to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that natural-language predicates can serve as verifiable state descriptors in text environments. No free parameters or invented physical entities are introduced; the main addition is the SDP process itself.

axioms (1)

domain assumption Natural language predicates can be reliably committed to and verified against raw text observations by the agent
The entire construction of certified states depends on accurate predicate evaluation without external ground truth.

invented entities (1)

State-Centric Decision Process (SDP) no independent evidence
purpose: To dynamically construct state space, transitions, and termination from text observations
New runtime framework introduced to supply the four missing objects that language environments lack.

pith-pipeline@v0.9.0 · 5509 in / 1373 out tokens · 40942 ms · 2026-05-14T19:33:46.948475+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 7 internal anchors

[1]

State abstractions for lifelong rein- forcement learning

David Abel, Dilip Arumugam, Lucas Lehnert, and Michael Littman. State abstractions for lifelong rein- forcement learning. InInternational conference on machine learning, pages 10–19. PMLR, 2018

work page 2018
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Hindsight experience replay.Advances in neural information processing systems, 30, 2017

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc- Grew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.Advances in neural information processing systems, 30, 2017

work page 2017
[5]

Agent context protocols enhance collective in- ference.arXiv preprint arXiv:2505.14569, 2025

Devansh Bhardwaj, Arjun Beniwal, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Karthik R Narasimhan, Ameet Deshpande, and Vishvak Murahari. Agent context protocols enhance collective in- ference.arXiv preprint arXiv:2505.14569, 2025

work page arXiv 2025
[6]

Reinforcement learning for mapping instructions to actions

Satchuthananthavale RK Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. Reinforcement learning for mapping instructions to actions. InProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 82–90, 2009

work page 2009
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[8]

arXiv preprint arXiv:2504.16563 , year=

Junjie Chen, Haitao Li, Jingli Yang, Yiqun Liu, and Qingyao Ai. Enhancing llm-based agents via global planning and hierarchical execution.arXiv preprint arXiv:2504.16563, 2025

work page arXiv 2025
[9]

Atlas: Constraints-aware multi- agent collaboration for real-world travel planning.arXiv preprint arXiv:2509.25586, 2025

Jihye Choi, Jinsung Yoon, Jiefeng Chen, Somesh Jha, and Tomas Pfister. Atlas: Constraints-aware multi- agent collaboration for real-world travel planning.arXiv preprint arXiv:2509.25586, 2025

work page arXiv 2025
[10]

Reasonplanner: Enhancing autonomous planning in dynamic environments with temporal knowledge graphs and llms.arXiv preprint arXiv:2410.09252, 2024

Minh Pham Dinh, Munira Syed, G Yankoski Michael, and Trenton W Ford. Reasonplanner: Enhancing autonomous planning in dynamic environments with temporal knowledge graphs and llms.arXiv preprint arXiv:2410.09252, 2024. 10

work page arXiv 2024
[11]

arXiv preprint arXiv:2503.09572 , year =

Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv preprint arXiv:2503.09572, 2025

work page arXiv 2025
[12]

arXiv preprint arXiv:2411.04468 , year=

Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, et al. Magentic-one: A generalist multi- agent system for solving complex tasks.arXiv preprint arXiv:2411.04468, 2024

work page arXiv 2024
[13]

Equivalence notions and model minimization in markov decision processes.Artificial intelligence, 147(1-2):163–223, 2003

Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in markov decision processes.Artificial intelligence, 147(1-2):163–223, 2003

work page 2003
[14]

Mirror: Multi-agent intra-and inter- reflection for optimized reasoning in tool learning.arXiv preprint arXiv:2505.20670, 2025

Zikang Guo, Benfeng Xu, Xiaorui Wang, and Zhendong Mao. Mirror: Multi-agent intra-and inter- reflection for optimized reasoning in tool learning.arXiv preprint arXiv:2505.20670, 2025

work page arXiv 2025
[15]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023

work page 2023
[16]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

work page 2022
[17]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through plan- ning with language models.arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

work page 1998
[20]

Shifting from ranking to set selection for retrieval augmented generation

Dahyun Lee, Yongrae Jo, Haeju Park, and Moontae Lee. Shifting from ranking to set selection for retrieval augmented generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17606–17619, 2025

work page 2025
[21]

Towards a unified theory of state abstraction for mdps.AI&M, 1(2):3, 2006

Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for mdps.AI&M, 1(2):3, 2006

work page 2006
[22]

Beyond entangled planning: Task-decoupled planning for long-horizon agents.arXiv preprint arXiv:2601.07577, 2026

Yunfan Li, Bingbing Xu, Xueyun Tian, Xiucheng Xu, and Huawei Shen. Beyond entangled planning: Task-decoupled planning for long-horizon agents.arXiv preprint arXiv:2601.07577, 2026

work page arXiv 2026
[23]

Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems, 36:23813–23825, 2023

Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithvi- raj Ammanabrolu, Yejin Choi, and Xiang Ren. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems, 36:23813–23825, 2023

work page 2023
[24]

Positive experience reflection for agents in interactive text environments

Philip Lippmann, Matthijs TJ Spaan, and Jie Yang. Positive experience reflection for agents in interactive text environments. InProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 131–142, 2025

work page 2025
[25]

Reason for future, act for now: A principled framework for autonomous llm agents with provable sample efficiency

Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, and Zhaoran Wang. Reason for future, act for now: A principled framework for autonomous llm agents with provable sample efficiency. arXiv preprint arXiv:2309.17382, 2023

work page arXiv 2023
[26]

Clin: A continually learning language agent for rapid task adaptation and generalization

Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. Clin: A continually learning language agent for rapid task adaptation and generalization.arXiv preprint arXiv:2310.10134, 2023

work page arXiv 2023
[27]

A survey of pomdp solution techniques.environment, 2(10):1–12, 2000

Kevin P Murphy et al. A survey of pomdp solution techniques.environment, 2(10):1–12, 2000

work page 2000
[28]

Prism: Agentic retrieval with llms for multi-hop question answering.arXiv preprint arXiv:2510.14278, 2025

Md Mahadi Hasan Nahid and Davood Rafiei. Prism: Agentic retrieval with llms for multi-hop question answering.arXiv preprint arXiv:2510.14278, 2025

work page arXiv 2025
[29]

RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023. 11

work page arXiv 2023
[30]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

work page 2023
[31]

John Wiley & Sons, 2014

Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014
[32]

Infogent: An agent-based framework for web information aggregation

Revanth Gangi Reddy, Sagnik Mukherjee, Jeonghwan Kim, Zhenhailong Wang, Dilek Hakkani-Tur, and Heng Ji. Infogent: An agent-based framework for web information aggregation. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 5745–5758, 2025

work page 2025
[33]

Universal value function approximators

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International conference on machine learning, pages 1312–1320. PMLR, 2015

work page 2015
[34]

Toolformer: Language models can teach them- selves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach them- selves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

work page 2023
[35]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[36]

Mastering the game of go without human knowledge.nature, 550(7676):354–359, 2017

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.nature, 550(7676):354–359, 2017

work page 2017
[37]

Cognitive architectures for language agents.Transactions on Machine Learning Research, 2023

Theodore Sumers, Shunyu Yao, Karthik R Narasimhan, and Thomas L Griffiths. Cognitive architectures for language agents.Transactions on Machine Learning Research, 2023

work page 2023
[38]

Adaplanner: Adaptive planning from feedback with language models.Advances in neural information processing systems, 36:58202– 58245, 2023

Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive planning from feedback with language models.Advances in neural information processing systems, 36:58202– 58245, 2023

work page 2023
[39]

Is chatgpt good at search? investigating large language models as re-ranking agents

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is chatgpt good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 14918– 14937, 2023

work page 2023
[40]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998

work page 1998
[41]

Musique: Multihop ques- tions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop ques- tions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[42]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 10014– 10037, 2023

work page 2023
[43]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[45]

Plan- and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan- and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

work page 2023
[46]

Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022. 12

work page 2022
[47]

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023

work page arXiv 2023
[48]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural infor- mation processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural infor- mation processing systems, 35:24824–24837, 2022

work page 2022
[49]

Partially observable markov decision processes for spoken dialog systems.Computer Speech & Language, 21(2):393–422, 2007

Jason D Williams and Steve Young. Partially observable markov decision processes for spoken dialog systems.Computer Speech & Language, 21(2):393–422, 2007

work page 2007
[50]

Simple statistical gradient-following algorithms for connectionist reinforcement learn- ing.Machine learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learn- ing.Machine learning, 8(3):229–256, 1992

work page 1992
[51]

The rise and potential of large language model based agents: A survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey. Science China Information Sciences, 68(2):121101, 2025

work page 2025
[52]

Travelplanner: A benchmark for real-world planning with language agents.arXiv preprint arXiv:2402.01622, 2024

Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents.arXiv preprint arXiv:2402.01622, 2024

work page arXiv 2024
[53]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2369– 2380, 2018

work page 2018
[54]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[55]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8938–8968, 2024

work page 2024
[57]

Pomdp-based statistical spoken dialog systems: A review.Proceedings of the IEEE, 101(5):1160–1179, 2013

Steve Young, Milica Gaši ´c, Blaise Thomson, and Jason D Williams. Pomdp-based statistical spoken dialog systems: A review.Proceedings of the IEEE, 101(5):1160–1179, 2013

work page 2013
[58]

Evoagent: Towards automatic multi-agent generation via evolutionary algorithms

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Dongsheng Li, and Deqing Yang. Evoagent: Towards automatic multi-agent generation via evolutionary algorithms. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6192–6217, 2025

work page 2025
[59]

Planning with multi-constraints via collaborative language agents

Cong Zhang, Xin Deik Goh, Dexun Li, Hao Zhang, and Yong Liu. Planning with multi-constraints via collaborative language agents. InProceedings of the 31st International Conference on Computational Linguistics, pages 10054–10082, 2025

work page 2025
[60]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 38, pages 19632–19642, 2024

work page 2024
[61]

Boyuan Zheng, Michael Y

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024

work page arXiv 2024
[62]

Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Lan- guage agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

work page arXiv 2023
[63]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 13 A Implementation details A.1 Common Settings All five benchmarks share the same SDP core loop; only the adapter...

work page internal anchor Pith review Pith/arXiv arXiv 2023