Recognition: 2 theorem links
· Lean TheoremReasoning with Language Model is Planning with World Model
Pith reviewed 2026-05-17 01:45 UTC · model grok-4.3
The pith
Language models can reason better by using themselves as world models and planning with tree search.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reasoning with a language model is equivalent to planning with a world model; by repurposing the model to predict next states and rewards and embedding it inside a Monte Carlo Tree Search procedure, the system can systematically explore and refine reasoning sequences to reach higher-reward solutions for complex tasks.
What carries the argument
RAP (Reasoning via Planning), which has the language model simulate state transitions as a world model and build a reasoning tree as an agent under the direction of Monte Carlo Tree Search and task-specific rewards.
If this is right
- RAP produces higher-quality action plans and solutions than chain-of-thought or least-to-most prompting with self-consistency on plan generation, math reasoning, and logical inference.
- The model can explore alternative reasoning paths and anticipate future states instead of committing to a single linear chain.
- Task-specific rewards combined with simulated outcomes allow efficient search that balances exploration and exploitation.
- The same model size can achieve better performance than larger models when the planning mechanism is added.
Where Pith is reading between the lines
- The approach could extend to interactive settings such as game playing or robotic control where accurate internal simulation reduces reliance on external feedback.
- Combining the method with external tools or fine-tuning for better state prediction might further limit compounding errors over long horizons.
- Similar tree-search structures might improve other generative tasks that benefit from lookahead, such as code synthesis or multi-turn dialogue planning.
Load-bearing premise
The language model's predictions of future states and action outcomes must remain accurate enough that simulation errors do not accumulate and invalidate the planning search.
What would settle it
A controlled test on a multi-step math or planning task in which the model's state predictions diverge from ground truth after only a few steps, causing the search to select a low-quality or invalid reasoning path that standard prompting would have avoided.
read the original abstract
Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. The deficiency stems from the key fact that LLMs lack an internal $\textit{world model}$ to predict the world $\textit{state}$ (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, $\underline{R}$easoning vi$\underline{a}$ $\underline{P}$lanning $\textbf{(RAP)}$. RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monto Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and task-specific rewards, and obtains a high-reward reasoning path efficiently with a proper balance between exploration $\textit{vs.}$ exploitation. We apply RAP to a variety of challenging reasoning problems including plan generation, math reasoning, and logical inference. Empirical results on these tasks demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Reasoning via Planning (RAP), a framework that repurposes an LLM as both a reasoning agent and a world model, then integrates it with Monte Carlo Tree Search (MCTS) to explore reasoning paths guided by simulated state transitions and task-specific rewards. It evaluates the approach on plan generation, mathematical reasoning, and logical inference tasks, reporting that RAP instantiated with LLaMA-33B outperforms Chain-of-Thought prompting with GPT-4 by 33% relative improvement on plan generation.
Significance. If the LLM-as-world-model component produces sufficiently accurate long-horizon state predictions, the work would demonstrate a concrete way to augment LLM reasoning with explicit planning, potentially improving performance on tasks requiring anticipation of future states. The use of a standard, off-the-shelf planning algorithm (MCTS) with separable reward signals is a methodological strength that keeps the contribution focused on the LLM simulation interface rather than algorithmic novelty.
major comments (2)
- [Abstract and §3] Abstract and §3 (RAP framework): the central claim that RAP enables 'deliberate planning' and yields the reported gains rests on the untested assumption that the LLM, when prompted as world model, produces next-state predictions accurate enough to guide MCTS without compounding errors. No quantitative measurement of world-model fidelity (e.g., next-state prediction accuracy or rollout error against ground-truth transitions on the evaluation tasks) is provided, which is load-bearing for interpreting the 33% relative improvement as evidence of principled planning rather than noisy search.
- [Experimental results] Experimental results section (plan-generation setting): the headline comparison (RAP on LLaMA-33B vs. CoT on GPT-4) reports no error bars, confidence intervals, or details on experimental controls such as prompt formatting, decoding parameters, or number of MCTS simulations. Without these, it is impossible to determine whether the observed difference is robust or sensitive to implementation choices.
minor comments (1)
- [§3.2] Notation for the world-model prompt template is introduced without a clear example or pseudocode, making it difficult to reproduce the exact simulation interface used in the MCTS rollouts.
Simulated Author's Rebuttal
We thank the referee for the insightful comments on the RAP framework and experimental reporting. We address each major point below with proposed revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (RAP framework): the central claim that RAP enables 'deliberate planning' and yields the reported gains rests on the untested assumption that the LLM, when prompted as world model, produces next-state predictions accurate enough to guide MCTS without compounding errors. No quantitative measurement of world-model fidelity (e.g., next-state prediction accuracy or rollout error against ground-truth transitions on the evaluation tasks) is provided, which is load-bearing for interpreting the 33% relative improvement as evidence of principled planning rather than noisy search.
Authors: We agree that explicit quantification of world-model accuracy would aid interpretation. In the original manuscript, we prioritized end-task performance as the primary evidence, since ground-truth state transitions are not explicitly annotated in the plan-generation and logical-inference benchmarks. The consistent gains over strong baselines (including GPT-4 CoT) and the use of task-specific rewards provide indirect support that the simulated transitions are useful. In revision we will add a new subsection in §3 discussing potential error accumulation, include qualitative rollout examples in the appendix, and report a simple next-state prediction accuracy metric on the math-reasoning tasks where intermediate variables offer clearer ground truth. revision: partial
-
Referee: [Experimental results] Experimental results section (plan-generation setting): the headline comparison (RAP on LLaMA-33B vs. CoT on GPT-4) reports no error bars, confidence intervals, or details on experimental controls such as prompt formatting, decoding parameters, or number of MCTS simulations. Without these, it is impossible to determine whether the observed difference is robust or sensitive to implementation choices.
Authors: We accept this criticism. The revised manuscript will report standard deviations across three random seeds for the plan-generation results, include 95% confidence intervals, and add a dedicated “Implementation Details” paragraph specifying the number of MCTS simulations (100), prompt templates, decoding parameters (temperature 0.7, top-p 0.9), and stopping criteria. These additions will appear in the experimental setup and results sections. revision: yes
Circularity Check
No circularity: RAP framework is a procedural combination of standard MCTS with LLM prompting, independent of its inputs
full rationale
The paper introduces RAP as an algorithmic framework that repurposes an LLM for both agent and world-model roles inside a Monte Carlo Tree Search loop, with task-specific rewards. No equations or derivations reduce a claimed prediction back to a fitted parameter or self-citation by construction. The planning procedure, tree expansion, and selection steps are described as standard MCTS operations applied to LLM-generated text; they do not presuppose the final performance numbers. Empirical results on plan generation, math, and logic tasks are presented as external measurements rather than tautological outputs. The design choice to use the same LLM for simulation is separable from the algorithmic contribution and does not create a self-definitional loop. This is the common case of an honest non-finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An LLM prompted as world model yields sufficiently accurate state predictions and action outcomes for planning guidance
Forward citations
Cited by 17 Pith papers
-
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
-
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
-
Training Large Language Models to Reason in a Continuous Latent Space
Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency t...
-
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
-
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
-
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
-
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.
-
Cognitive Architectures for Language Agents
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic de...
-
A Survey on Large Language Model based Autonomous Agents
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
-
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
-
Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning
A case-based learning framework extracts reusable knowledge from past tasks to improve LLM agents' structured performance on complex real-world tasks, outperforming standard prompting baselines especially as task comp...
-
Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space
Inclusion-of-Thoughts purifies multiple-choice questions by keeping only plausible options, stabilizing LLM preferences and improving chain-of-thought results on reasoning benchmarks.
-
Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol
An MCP-native workflow engine decouples agent reasoning from execution by using declarative blueprints, reducing token cost by over 99% on a 67-step Kubernetes synchronization task.
-
Understanding the planning of LLM agents: A survey
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Reference graph
Works this paper leans on
-
[1]
Alan Baddeley. 1992. Working memory. Science, 255(5044):556--559
work page 1992
-
[2]
Robert Eamon Briscoe. 2011. Mental imagery and the varieties of amodal perception. Pacific Philosophical Quarterly, 92(2):153--173
work page 2011
-
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901
work page 2020
-
[5]
Tom Bylander. 1994. The computational complexity of propositional strips planning. Artificial Intelligence, 69(1-2):165--204
work page 1994
-
[6]
Eduardo F Camacho and Carlos Bordons Alba. 2013. Model predictive control. Springer science & business media
work page 2013
-
[9]
R \'e mi Coulom. 2007. Efficient selectivity and backup operators in monte-carlo tree search. In Computers and Games: 5th International Conference, CG 2006, Turin, Italy, May 29-31, 2006. Revised Papers 5, pages 72--83. Springer
work page 2007
-
[11]
Wojciech W Gasparski and Tufan Orel. 2014. Designology: Studies on Planning for Action, volume 1. Transaction Publishers
work page 2014
-
[12]
Dedre Gentner and Albert L Stevens. 2014. Mental models. Psychology Press
work page 2014
-
[13]
David Ha and J \"u rgen Schmidhuber. 2018 a . Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31
work page 2018
-
[17]
Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. 2023 a . Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36
work page 2023
-
[18]
Shibo Hao, Bowen Tan, Kaiwen Tang, Bin Ni, Xiyan Shao, Hengzhe Zhang, Eric Xing, and Zhiting Hu. 2023 b . Bertnet: Harvesting knowledge graphs with arbitrary relations from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5000--5015
work page 2023
-
[19]
Mark K Ho, David Abel, Carlos G Correa, Michael L Littman, Jonathan D Cohen, and Thomas L Griffiths. 2021. Control of mental representations in human planning. arXiv e-prints, pages arXiv--2105
work page 2021
-
[22]
Quentin JM Huys, Neir Eshel, Elizabeth O'Nions, Luke Sheridan, Peter Dayan, and Jonathan P Roiser. 2012. Bonsai trees in your head: how the pavlovian system sculpts goal-directed choices by pruning decision trees. PLoS computational biology, 8(3):e1002410
work page 2012
-
[23]
Yu-qian Jiang, Shi-qi Zhang, Piyush Khandelwal, and Peter Stone. 2019. Task planning in robotics: an empirical comparison of pddl-and asp-based systems. Frontiers of Information Technology & Electronic Engineering, 20:363--373
work page 2019
-
[24]
Philip N Johnson-Laird. 2010. Mental models and human reasoning. Proceedings of the National Academy of Sciences, 107(43):18243--18250
work page 2010
-
[25]
Philip Nicholas Johnson-Laird. 1983. Mental models: Towards a cognitive science of language, inference, and consciousness. 6. Harvard University Press
work page 1983
-
[27]
Levente Kocsis and Csaba Szepesv \'a ri. 2006. Bandit based monte-carlo planning. In Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings 17, pages 282--293. Springer
work page 2006
-
[29]
Yann LeCun. 2022. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62
work page 2022
-
[33]
Yutaka Matsuo, Yann LeCun, Maneesh Sahani, Doina Precup, David Silver, Masashi Sugiyama, Eiji Uchibe, and Jun Morimoto. 2022. Deep learning, reinforcement learning, and world models. Neural Networks
work page 2022
-
[34]
John McCarthy. 1963. Situations, actions, and causal laws. Technical report, STANFORD UNIV CA DEPT OF COMPUTER SCIENCE
work page 1963
-
[36]
OpenAI. 2023. http://arxiv.org/abs/2303.08774 Gpt-4 technical report
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. 2018. http://arxiv.org/abs/1806.07011 Virtualhome: Simulating household activities via programs
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. 2020. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604--609
work page 2020
-
[42]
Jay Schulkin. 2012. Action, perception and the brain: Adaptation and cephalic expression. Springer
work page 2012
-
[43]
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. 2020. Planning to explore via self-supervised world models. In International Conference on Machine Learning, pages 8583--8592. PMLR
work page 2020
-
[44]
Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. ArXiv, abs/2303.11366
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Edward C Tolman. 1948. Cognitive maps in rats and men. Psychological review, 55(4):189
work page 1948
-
[55]
Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. 2023. Daydreamer: World models for physical robot learning. In Conference on Robot Learning, pages 2226--2240. PMLR
work page 2023
-
[56]
Jiannan Xiang, Tianhua Tao, Yi Gu, Tianmin Shu, Zirui Wang, Zichao Yang, and Zhiting Hu. 2023. Language models meet world models: Embodied experiences enhance language models. Advances in neural information processing systems, 36
work page 2023
-
[60]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=
work page 2022
-
[62]
arXiv preprint arXiv:2106.15772 , year=
A diverse corpus for evaluating and developing English math word problem solvers , author=. arXiv preprint arXiv:2106.15772 , year=
-
[63]
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
arXiv preprint arXiv:2012.13048 , year=
Proofwriter: Generating implications, proofs, and abductive statements over natural language , author=. arXiv preprint arXiv:2012.13048 , year=
-
[65]
arXiv preprint arXiv:2209.00840 , year=
Folio: Natural language reasoning with first-order logic , author=. arXiv preprint arXiv:2209.00840 , year=
-
[66]
arXiv preprint arXiv:2210.01240 , year=
Language models are greedy reasoners: A systematic formal analysis of chain-of-thought , author=. arXiv preprint arXiv:2210.01240 , year=
- [67]
-
[68]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[70]
PaLM: Scaling Language Modeling with Pathways
Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[73]
arXiv preprint arXiv:2206.10498 , year=
Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change) , author=. arXiv preprint arXiv:2206.10498 , year=
-
[74]
World models , author=. arXiv preprint arXiv:1803.10122 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[75]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[77]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
Transactions of the Association for Computational Linguistics , volume=
Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=
work page 2021
-
[80]
arXiv preprint arXiv:2110.07178 , year=
Symbolic knowledge distillation: from general language models to commonsense models , author=. arXiv preprint arXiv:2110.07178 , year=
-
[81]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Gpteval: Nlg evaluation using gpt-4 with better human alignment , author=. arXiv preprint arXiv:2303.16634 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[82]
Model-Based Planning with Discrete and Continuous Actions
Model-based planning with discrete and continuous actions , author=. arXiv preprint arXiv:1705.07177 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[83]
IEEE Transactions on Computational Intelligence and AI in games , volume=
A survey of monte carlo tree search methods , author=. IEEE Transactions on Computational Intelligence and AI in games , volume=. 2012 , publisher=
work page 2012
-
[84]
arXiv preprint arXiv:2302.06706 , year=
On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark) , author=. arXiv preprint arXiv:2302.06706 , year=
- [85]
- [86]
-
[87]
Applied optimal control: optimization, estimation and control , author=. 1975 , publisher=
work page 1975
-
[88]
arXiv preprint arXiv:2104.05336 , year=
Machine translation decoding beyond beam search , author=. arXiv preprint arXiv:2104.05336 , year=
-
[89]
arXiv preprint arXiv:2212.10012 , year=
Language Modeling with Latent Situations , author=. arXiv preprint arXiv:2212.10012 , year=
-
[90]
arXiv preprint arXiv:2205.11822 , year=
Maieutic prompting: Logically consistent reasoning with recursive explanations , author=. arXiv preprint arXiv:2205.11822 , year=
-
[91]
Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , pages=
Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts , author=. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , pages=
work page 2022
-
[92]
arXiv preprint arXiv:2205.07381 , year=
SEQZERO: Few-shot compositional semantic parsing with sequential prompts and zero-shot models , author=. arXiv preprint arXiv:2205.07381 , year=
-
[93]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
Iteratively prompt pre-trained language models for chain of thought , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2022
-
[94]
arXiv preprint arXiv:2304.11657 , year=
Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models , author=. arXiv preprint arXiv:2304.11657 , year=
-
[95]
arXiv preprint arXiv:2304.01904 , year=
REFINER: Reasoning Feedback on Intermediate Representations , author=. arXiv preprint arXiv:2304.01904 , year=
-
[96]
arXiv preprint arXiv:2211.00053 , year=
Generating Sequences by Learning to Self-Correct , author=. arXiv preprint arXiv:2211.00053 , year=
-
[97]
arXiv preprint arXiv:2305.00633 , year=
Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding , author=. arXiv preprint arXiv:2305.00633 , year=
-
[98]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Tree of thoughts: Deliberate problem solving with large language models , author=. arXiv preprint arXiv:2305.10601 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[99]
Advances in neural information processing systems , volume=
Learning a world model and planning with a self-organizing, dynamic neural system , author=. Advances in neural information processing systems , volume=
-
[100]
arXiv preprint arXiv:2210.16257 , year=
Solving Math Word Problem via Cooperative Reasoning induced Language Models , author=. arXiv preprint arXiv:2210.16257 , year=
-
[101]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Do as i can, not as i say: Grounding language in robotic affordances , author=. arXiv preprint arXiv:2204.01691 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[102]
arXiv preprint arXiv:2303.16563 , year=
Plan4MC: Skill Reinforcement Learning and Planning for Open-World Minecraft Tasks , author=. arXiv preprint arXiv:2303.16563 , year=
-
[103]
Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents , author=. arXiv preprint arXiv:2302.01560 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[104]
arXiv preprint arXiv:2301.13379 , year=
Faithful chain-of-thought reasoning , author=. arXiv preprint arXiv:2301.13379 , year=
-
[105]
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+ P: Empowering Large Language Models with Optimal Planning Proficiency , author=. arXiv preprint arXiv:2304.11477 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[106]
arXiv preprint arXiv:2205.09712 , year=
Selection-inference: Exploiting large language models for interpretable logical reasoning , author=. arXiv preprint arXiv:2205.09712 , year=
-
[107]
Advances in Neural Information Processing Systems , volume=
Hypertree proof search for neural theorem proving , author=. Advances in Neural Information Processing Systems , volume=
-
[108]
The winograd schema challenge , author=. Thirteenth international conference on the principles of knowledge representation and reasoning , year=
-
[109]
Communications of the ACM , volume=
Commonsense reasoning and commonsense knowledge in artificial intelligence , author=. Communications of the ACM , volume=. 2015 , publisher=
work page 2015
- [110]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.