pith. machine review for the scientific record. sign in

arxiv: 2305.14992 · v2 · submitted 2023-05-24 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Reasoning with Language Model is Planning with World Model

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords large language modelsreasoningplanningworld modelmonte carlo tree searchchain of thoughtaction planning
0
0 comments X

The pith

Language models can reason better by using themselves as world models and planning with tree search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that language models struggle with planning because they lack an internal world model to predict how states evolve after actions or to simulate long-term outcomes. To address this, the authors introduce a framework that prompts the same model to act as a world simulator while also serving as a reasoning agent that explores paths via Monte Carlo Tree Search. The search balances exploration of alternatives against exploitation of promising steps, guided by task rewards and simulated states. This produces stronger results than chain-of-thought prompting on plan generation, math problems, and logical inference. One reported outcome is that the method on a 33-billion-parameter model yields a 33 percent relative gain over chain-of-thought on a larger model for generating action plans.

Core claim

Reasoning with a language model is equivalent to planning with a world model; by repurposing the model to predict next states and rewards and embedding it inside a Monte Carlo Tree Search procedure, the system can systematically explore and refine reasoning sequences to reach higher-reward solutions for complex tasks.

What carries the argument

RAP (Reasoning via Planning), which has the language model simulate state transitions as a world model and build a reasoning tree as an agent under the direction of Monte Carlo Tree Search and task-specific rewards.

If this is right

  • RAP produces higher-quality action plans and solutions than chain-of-thought or least-to-most prompting with self-consistency on plan generation, math reasoning, and logical inference.
  • The model can explore alternative reasoning paths and anticipate future states instead of committing to a single linear chain.
  • Task-specific rewards combined with simulated outcomes allow efficient search that balances exploration and exploitation.
  • The same model size can achieve better performance than larger models when the planning mechanism is added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to interactive settings such as game playing or robotic control where accurate internal simulation reduces reliance on external feedback.
  • Combining the method with external tools or fine-tuning for better state prediction might further limit compounding errors over long horizons.
  • Similar tree-search structures might improve other generative tasks that benefit from lookahead, such as code synthesis or multi-turn dialogue planning.

Load-bearing premise

The language model's predictions of future states and action outcomes must remain accurate enough that simulation errors do not accumulate and invalidate the planning search.

What would settle it

A controlled test on a multi-step math or planning task in which the model's state predictions diverge from ground truth after only a few steps, causing the search to select a low-quality or invalid reasoning path that standard prompting would have avoided.

read the original abstract

Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. The deficiency stems from the key fact that LLMs lack an internal $\textit{world model}$ to predict the world $\textit{state}$ (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, $\underline{R}$easoning vi$\underline{a}$ $\underline{P}$lanning $\textbf{(RAP)}$. RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monto Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and task-specific rewards, and obtains a high-reward reasoning path efficiently with a proper balance between exploration $\textit{vs.}$ exploitation. We apply RAP to a variety of challenging reasoning problems including plan generation, math reasoning, and logical inference. Empirical results on these tasks demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Reasoning via Planning (RAP), a framework that repurposes an LLM as both a reasoning agent and a world model, then integrates it with Monte Carlo Tree Search (MCTS) to explore reasoning paths guided by simulated state transitions and task-specific rewards. It evaluates the approach on plan generation, mathematical reasoning, and logical inference tasks, reporting that RAP instantiated with LLaMA-33B outperforms Chain-of-Thought prompting with GPT-4 by 33% relative improvement on plan generation.

Significance. If the LLM-as-world-model component produces sufficiently accurate long-horizon state predictions, the work would demonstrate a concrete way to augment LLM reasoning with explicit planning, potentially improving performance on tasks requiring anticipation of future states. The use of a standard, off-the-shelf planning algorithm (MCTS) with separable reward signals is a methodological strength that keeps the contribution focused on the LLM simulation interface rather than algorithmic novelty.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (RAP framework): the central claim that RAP enables 'deliberate planning' and yields the reported gains rests on the untested assumption that the LLM, when prompted as world model, produces next-state predictions accurate enough to guide MCTS without compounding errors. No quantitative measurement of world-model fidelity (e.g., next-state prediction accuracy or rollout error against ground-truth transitions on the evaluation tasks) is provided, which is load-bearing for interpreting the 33% relative improvement as evidence of principled planning rather than noisy search.
  2. [Experimental results] Experimental results section (plan-generation setting): the headline comparison (RAP on LLaMA-33B vs. CoT on GPT-4) reports no error bars, confidence intervals, or details on experimental controls such as prompt formatting, decoding parameters, or number of MCTS simulations. Without these, it is impossible to determine whether the observed difference is robust or sensitive to implementation choices.
minor comments (1)
  1. [§3.2] Notation for the world-model prompt template is introduced without a clear example or pseudocode, making it difficult to reproduce the exact simulation interface used in the MCTS rollouts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on the RAP framework and experimental reporting. We address each major point below with proposed revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (RAP framework): the central claim that RAP enables 'deliberate planning' and yields the reported gains rests on the untested assumption that the LLM, when prompted as world model, produces next-state predictions accurate enough to guide MCTS without compounding errors. No quantitative measurement of world-model fidelity (e.g., next-state prediction accuracy or rollout error against ground-truth transitions on the evaluation tasks) is provided, which is load-bearing for interpreting the 33% relative improvement as evidence of principled planning rather than noisy search.

    Authors: We agree that explicit quantification of world-model accuracy would aid interpretation. In the original manuscript, we prioritized end-task performance as the primary evidence, since ground-truth state transitions are not explicitly annotated in the plan-generation and logical-inference benchmarks. The consistent gains over strong baselines (including GPT-4 CoT) and the use of task-specific rewards provide indirect support that the simulated transitions are useful. In revision we will add a new subsection in §3 discussing potential error accumulation, include qualitative rollout examples in the appendix, and report a simple next-state prediction accuracy metric on the math-reasoning tasks where intermediate variables offer clearer ground truth. revision: partial

  2. Referee: [Experimental results] Experimental results section (plan-generation setting): the headline comparison (RAP on LLaMA-33B vs. CoT on GPT-4) reports no error bars, confidence intervals, or details on experimental controls such as prompt formatting, decoding parameters, or number of MCTS simulations. Without these, it is impossible to determine whether the observed difference is robust or sensitive to implementation choices.

    Authors: We accept this criticism. The revised manuscript will report standard deviations across three random seeds for the plan-generation results, include 95% confidence intervals, and add a dedicated “Implementation Details” paragraph specifying the number of MCTS simulations (100), prompt templates, decoding parameters (temperature 0.7, top-p 0.9), and stopping criteria. These additions will appear in the experimental setup and results sections. revision: yes

Circularity Check

0 steps flagged

No circularity: RAP framework is a procedural combination of standard MCTS with LLM prompting, independent of its inputs

full rationale

The paper introduces RAP as an algorithmic framework that repurposes an LLM for both agent and world-model roles inside a Monte Carlo Tree Search loop, with task-specific rewards. No equations or derivations reduce a claimed prediction back to a fitted parameter or self-citation by construction. The planning procedure, tree expansion, and selection steps are described as standard MCTS operations applied to LLM-generated text; they do not presuppose the final performance numbers. Empirical results on plan generation, math, and logic tasks are presented as external measurements rather than tautological outputs. The design choice to use the same LLM for simulation is separable from the algorithmic contribution and does not create a self-definitional loop. This is the common case of an honest non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that an LLM can be prompted to produce usable state predictions and transition functions; no free parameters or invented entities are declared in the abstract.

axioms (1)
  • domain assumption An LLM prompted as world model yields sufficiently accurate state predictions and action outcomes for planning guidance
    Invoked as the core justification for repurposing the LLM; appears in the problem statement and method description.

pith-pipeline@v0.9.0 · 5643 in / 1182 out tokens · 64493 ms · 2026-05-17T01:45:01.211709+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    cs.CL 2023-05 accept novelty 8.0

    Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

  2. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

  3. Training Large Language Models to Reason in a Continuous Latent Space

    cs.CL 2024-12 unverdicted novelty 7.0

    Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency t...

  4. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    cs.CL 2023-12 accept novelty 7.0

    A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.

  5. STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

    cs.CL 2026-05 unverdicted novelty 6.0

    STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.

  6. Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.

  7. Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    cs.CL 2024-06 conditional novelty 6.0

    OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.

  8. Cognitive Architectures for Language Agents

    cs.AI 2023-09 accept novelty 6.0

    CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic de...

  9. A Survey on Large Language Model based Autonomous Agents

    cs.AI 2023-08 accept novelty 6.0

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

  10. Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

    cs.AI 2026-05 conditional novelty 5.0

    The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.

  11. NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

    cs.LG 2026-05 unverdicted novelty 5.0

    Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.

  12. Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning

    cs.AI 2026-04 unverdicted novelty 5.0

    A case-based learning framework extracts reusable knowledge from past tasks to improve LLM agents' structured performance on complex real-world tasks, outperforming standard prompting baselines especially as task comp...

  13. Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

    cs.CL 2026-03 unverdicted novelty 5.0

    Inclusion-of-Thoughts purifies multiple-choice questions by keeping only plausible options, stabilizing LLM preferences and improving chain-of-thought results on reasoning benchmarks.

  14. Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol

    cs.DC 2026-03 unverdicted novelty 5.0

    An MCP-native workflow engine decouples agent reasoning from execution by using declarative blueprints, reducing token cost by over 99% on a 67-step Kubernetes synchronization task.

  15. Understanding the planning of LLM agents: A survey

    cs.AI 2024-02 accept novelty 4.0

    A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.

  16. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

  17. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

134 extracted references · 134 canonical work pages · cited by 17 Pith papers · 31 internal anchors

  1. [1]

    Alan Baddeley. 1992. Working memory. Science, 255(5044):556--559

  2. [2]

    Robert Eamon Briscoe. 2011. Mental imagery and the varieties of amodal perception. Pacific Philosophical Quarterly, 92(2):153--173

  3. [3]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  4. [5]

    Tom Bylander. 1994. The computational complexity of propositional strips planning. Artificial Intelligence, 69(1-2):165--204

  5. [6]

    Eduardo F Camacho and Carlos Bordons Alba. 2013. Model predictive control. Springer science & business media

  6. [9]

    R \'e mi Coulom. 2007. Efficient selectivity and backup operators in monte-carlo tree search. In Computers and Games: 5th International Conference, CG 2006, Turin, Italy, May 29-31, 2006. Revised Papers 5, pages 72--83. Springer

  7. [11]

    Wojciech W Gasparski and Tufan Orel. 2014. Designology: Studies on Planning for Action, volume 1. Transaction Publishers

  8. [12]

    Dedre Gentner and Albert L Stevens. 2014. Mental models. Psychology Press

  9. [13]

    David Ha and J \"u rgen Schmidhuber. 2018 a . Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31

  10. [17]

    Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. 2023 a . Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36

  11. [18]

    Shibo Hao, Bowen Tan, Kaiwen Tang, Bin Ni, Xiyan Shao, Hengzhe Zhang, Eric Xing, and Zhiting Hu. 2023 b . Bertnet: Harvesting knowledge graphs with arbitrary relations from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5000--5015

  12. [19]

    Mark K Ho, David Abel, Carlos G Correa, Michael L Littman, Jonathan D Cohen, and Thomas L Griffiths. 2021. Control of mental representations in human planning. arXiv e-prints, pages arXiv--2105

  13. [22]

    Quentin JM Huys, Neir Eshel, Elizabeth O'Nions, Luke Sheridan, Peter Dayan, and Jonathan P Roiser. 2012. Bonsai trees in your head: how the pavlovian system sculpts goal-directed choices by pruning decision trees. PLoS computational biology, 8(3):e1002410

  14. [23]

    Yu-qian Jiang, Shi-qi Zhang, Piyush Khandelwal, and Peter Stone. 2019. Task planning in robotics: an empirical comparison of pddl-and asp-based systems. Frontiers of Information Technology & Electronic Engineering, 20:363--373

  15. [24]

    Philip N Johnson-Laird. 2010. Mental models and human reasoning. Proceedings of the National Academy of Sciences, 107(43):18243--18250

  16. [25]

    Philip Nicholas Johnson-Laird. 1983. Mental models: Towards a cognitive science of language, inference, and consciousness. 6. Harvard University Press

  17. [27]

    Levente Kocsis and Csaba Szepesv \'a ri. 2006. Bandit based monte-carlo planning. In Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings 17, pages 282--293. Springer

  18. [29]

    Yann LeCun. 2022. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62

  19. [33]

    Yutaka Matsuo, Yann LeCun, Maneesh Sahani, Doina Precup, David Silver, Masashi Sugiyama, Eiji Uchibe, and Jun Morimoto. 2022. Deep learning, reinforcement learning, and world models. Neural Networks

  20. [34]

    John McCarthy. 1963. Situations, actions, and causal laws. Technical report, STANFORD UNIV CA DEPT OF COMPUTER SCIENCE

  21. [36]

    OpenAI. 2023. http://arxiv.org/abs/2303.08774 Gpt-4 technical report

  22. [38]

    Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. 2018. http://arxiv.org/abs/1806.07011 Virtualhome: Simulating household activities via programs

  23. [41]

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. 2020. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604--609

  24. [42]

    Jay Schulkin. 2012. Action, perception and the brain: Adaptation and cephalic expression. Springer

  25. [43]

    Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. 2020. Planning to explore via self-supervised world models. In International Conference on Machine Learning, pages 8583--8592. PMLR

  26. [44]

    Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. ArXiv, abs/2303.11366

  27. [47]

    Edward C Tolman. 1948. Cognitive maps in rats and men. Psychological review, 55(4):189

  28. [55]

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. 2023. Daydreamer: World models for physical robot learning. In Conference on Robot Learning, pages 2226--2240. PMLR

  29. [56]

    Jiannan Xiang, Tianhua Tao, Yi Gu, Tianmin Shu, Zirui Wang, Zichao Yang, and Zhiting Hu. 2023. Language models meet world models: Embodied experiences enhance language models. Advances in neural information processing systems, 36

  30. [60]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  31. [61]

    2, 2022-06-27 , author=

    A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=

  32. [62]

    arXiv preprint arXiv:2106.15772 , year=

    A diverse corpus for evaluating and developing English math word problem solvers , author=. arXiv preprint arXiv:2106.15772 , year=

  33. [63]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

  34. [64]

    arXiv preprint arXiv:2012.13048 , year=

    Proofwriter: Generating implications, proofs, and abductive statements over natural language , author=. arXiv preprint arXiv:2012.13048 , year=

  35. [65]

    arXiv preprint arXiv:2209.00840 , year=

    Folio: Natural language reasoning with first-order logic , author=. arXiv preprint arXiv:2209.00840 , year=

  36. [66]

    arXiv preprint arXiv:2210.01240 , year=

    Language models are greedy reasoners: A systematic formal analysis of chain-of-thought , author=. arXiv preprint arXiv:2210.01240 , year=

  37. [67]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  38. [68]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  39. [69]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  40. [70]

    PaLM: Scaling Language Modeling with Pathways

    Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

  41. [71]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

  42. [72]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  43. [73]

    arXiv preprint arXiv:2206.10498 , year=

    Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change) , author=. arXiv preprint arXiv:2206.10498 , year=

  44. [74]

    World Models

    World models , author=. arXiv preprint arXiv:1803.10122 , year=

  45. [75]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=

  46. [76]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=

  47. [77]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

  48. [78]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  49. [79]

    Transactions of the Association for Computational Linguistics , volume=

    Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

  50. [80]

    arXiv preprint arXiv:2110.07178 , year=

    Symbolic knowledge distillation: from general language models to commonsense models , author=. arXiv preprint arXiv:2110.07178 , year=

  51. [81]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Gpteval: Nlg evaluation using gpt-4 with better human alignment , author=. arXiv preprint arXiv:2303.16634 , year=

  52. [82]

    Model-Based Planning with Discrete and Continuous Actions

    Model-based planning with discrete and continuous actions , author=. arXiv preprint arXiv:1705.07177 , year=

  53. [83]

    IEEE Transactions on Computational Intelligence and AI in games , volume=

    A survey of monte carlo tree search methods , author=. IEEE Transactions on Computational Intelligence and AI in games , volume=. 2012 , publisher=

  54. [84]

    arXiv preprint arXiv:2302.06706 , year=

    On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark) , author=. arXiv preprint arXiv:2302.06706 , year=

  55. [85]

    2011 , publisher=

    Thinking, fast and slow , author=. 2011 , publisher=

  56. [86]

    1967 , publisher=

    The nature of explanation , author=. 1967 , publisher=

  57. [87]

    1975 , publisher=

    Applied optimal control: optimization, estimation and control , author=. 1975 , publisher=

  58. [88]

    arXiv preprint arXiv:2104.05336 , year=

    Machine translation decoding beyond beam search , author=. arXiv preprint arXiv:2104.05336 , year=

  59. [89]

    arXiv preprint arXiv:2212.10012 , year=

    Language Modeling with Latent Situations , author=. arXiv preprint arXiv:2212.10012 , year=

  60. [90]

    arXiv preprint arXiv:2205.11822 , year=

    Maieutic prompting: Logically consistent reasoning with recursive explanations , author=. arXiv preprint arXiv:2205.11822 , year=

  61. [91]

    Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , pages=

    Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts , author=. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , pages=

  62. [92]

    arXiv preprint arXiv:2205.07381 , year=

    SEQZERO: Few-shot compositional semantic parsing with sequential prompts and zero-shot models , author=. arXiv preprint arXiv:2205.07381 , year=

  63. [93]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    Iteratively prompt pre-trained language models for chain of thought , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  64. [94]

    arXiv preprint arXiv:2304.11657 , year=

    Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models , author=. arXiv preprint arXiv:2304.11657 , year=

  65. [95]

    arXiv preprint arXiv:2304.01904 , year=

    REFINER: Reasoning Feedback on Intermediate Representations , author=. arXiv preprint arXiv:2304.01904 , year=

  66. [96]

    arXiv preprint arXiv:2211.00053 , year=

    Generating Sequences by Learning to Self-Correct , author=. arXiv preprint arXiv:2211.00053 , year=

  67. [97]

    arXiv preprint arXiv:2305.00633 , year=

    Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding , author=. arXiv preprint arXiv:2305.00633 , year=

  68. [98]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Tree of thoughts: Deliberate problem solving with large language models , author=. arXiv preprint arXiv:2305.10601 , year=

  69. [99]

    Advances in neural information processing systems , volume=

    Learning a world model and planning with a self-organizing, dynamic neural system , author=. Advances in neural information processing systems , volume=

  70. [100]

    arXiv preprint arXiv:2210.16257 , year=

    Solving Math Word Problem via Cooperative Reasoning induced Language Models , author=. arXiv preprint arXiv:2210.16257 , year=

  71. [101]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Do as i can, not as i say: Grounding language in robotic affordances , author=. arXiv preprint arXiv:2204.01691 , year=

  72. [102]

    arXiv preprint arXiv:2303.16563 , year=

    Plan4MC: Skill Reinforcement Learning and Planning for Open-World Minecraft Tasks , author=. arXiv preprint arXiv:2303.16563 , year=

  73. [103]

    Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

    Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents , author=. arXiv preprint arXiv:2302.01560 , year=

  74. [104]

    arXiv preprint arXiv:2301.13379 , year=

    Faithful chain-of-thought reasoning , author=. arXiv preprint arXiv:2301.13379 , year=

  75. [105]

    LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    LLM+ P: Empowering Large Language Models with Optimal Planning Proficiency , author=. arXiv preprint arXiv:2304.11477 , year=

  76. [106]

    arXiv preprint arXiv:2205.09712 , year=

    Selection-inference: Exploiting large language models for interpretable logical reasoning , author=. arXiv preprint arXiv:2205.09712 , year=

  77. [107]

    Advances in Neural Information Processing Systems , volume=

    Hypertree proof search for neural theorem proving , author=. Advances in Neural Information Processing Systems , volume=

  78. [108]

    Thirteenth international conference on the principles of knowledge representation and reasoning , year=

    The winograd schema challenge , author=. Thirteenth international conference on the principles of knowledge representation and reasoning , year=

  79. [109]

    Communications of the ACM , volume=

    Commonsense reasoning and commonsense knowledge in artificial intelligence , author=. Communications of the ACM , volume=. 2015 , publisher=

  80. [110]

    1959 , publisher=

    Programs with common sense , author=. 1959 , publisher=

Showing first 80 references.