{"paper":{"title":"Reasoning with Language Model is Planning with World Model","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Language models can reason better by using themselves as world models and planning with tree search.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Daisy Zhe Wang, Haodi Ma, Joshua Jiahua Hong, Shibo Hao, Yi Gu, Zhen Wang, Zhiting Hu","submitted_at":"2023-05-24T10:28:28Z","abstract_excerpt":"Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. The deficiency stems from the key fact that LLMs lack an internal $\\textit{world model}$ to predict the world $\\textit{state}$ (e.g., environment status, intermediate variable values) and simulate long-term outcomes"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the LLM, when prompted to act as world model, produces sufficiently accurate state predictions and transition simulations to guide search without compounding errors that invalidate the planning process.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Language models can reason better by using themselves as world models and planning with tree search.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"60587927e399c3a0400a8a9f486237cbcb4f99a44f4a472378d82856ff0484f2"},"source":{"id":"2305.14992","kind":"arxiv","version":2},"verdict":{"id":"c1417558-31a0-4e87-9ca9-aad60b332789","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T01:45:01.211709Z","strongest_claim":"RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.","one_line_summary":"RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the LLM, when prompted to act as world model, produces sufficiently accurate state predictions and transition simulations to guide search without compounding errors that invalidate the planning process.","pith_extraction_headline":"Language models can reason better by using themselves as world models and planning with tree search."},"references":{"count":134,"sample":[{"doi":"","year":1992,"title":"Alan Baddeley. 1992. Working memory. Science, 255(5044):556--559","work_id":"f36b7e9d-fb69-46ab-9df5-b0ea2ea4b066","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2011,"title":"Robert Eamon Briscoe. 2011. Mental imagery and the varieties of amodal perception. Pacific Philosophical Quarterly, 92(2):153--173","work_id":"6f8e00d0-b211-4677-8661-131bbfc2b45e","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot lear","work_id":"50684699-ce18-4086-8bac-7cecd178fad0","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1994,"title":"Tom Bylander. 1994. The computational complexity of propositional strips planning. Artificial Intelligence, 69(1-2):165--204","work_id":"3be5a615-51c1-49c4-ab86-dff31f715a8e","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2013,"title":"Eduardo F Camacho and Carlos Bordons Alba. 2013. Model predictive control. Springer science & business media","work_id":"775ab905-fd2a-4ddd-821e-0fc8324fe7ed","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":134,"snapshot_sha256":"03cd7f9af3bd37b779e1d8c431df627824a113282ccd5043b090d5cdbc1ce01f","internal_anchors":31},"formal_canon":{"evidence_count":2,"snapshot_sha256":"65afd9f30e6abf4560f14233598d202f80622ee4c95e7e11b7b50b9f26131303"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}