Recognition: 2 theorem links
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
Pith reviewed 2026-05-16 08:40 UTC · model grok-4.3
The pith
Plan-and-solve prompting divides tasks into subtasks before solving them to cut missing-step errors in zero-shot chain-of-thought reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that zero-shot prompting benefits from an explicit two-stage process: first generating a plan that decomposes the overall reasoning problem into ordered subtasks, then executing those subtasks sequentially according to the plan. This structure directly targets missing-step errors and, when augmented with instructions on calculation and semantic accuracy, yields higher final answer accuracy across multiple reasoning benchmarks.
What carries the argument
The Plan-and-Solve (PS) prompting template, which instructs the model to first output a plan and then solve according to it; the PS+ variant adds explicit guidance on avoiding calculation and misunderstanding errors.
If this is right
- Zero-shot reasoning can approach the accuracy of few-shot chain-of-thought without any hand-crafted examples.
- The method reduces missing-step errors across arithmetic, commonsense, and symbolic reasoning tasks.
- Adding plan generation as an intermediate step raises final-answer accuracy on GPT-3 models by a consistent margin.
- Performance remains competitive with program-of-thought baselines while staying simpler to implement.
Where Pith is reading between the lines
- The same explicit decomposition step could be inserted into other zero-shot strategies such as tree-of-thoughts to organize branching exploration.
- Models appear to hold latent planning ability that surface-level step-by-step prompts do not activate.
- Testing whether plan creation and execution benefit from separate model calls could further improve results.
- The technique may transfer to long-horizon domains such as code generation or multi-stage scientific reasoning.
Load-bearing premise
The performance gains arise specifically from the plan-then-execute ordering rather than from prompt length or added instructions alone.
What would settle it
A control prompt that matches PS in length and instruction detail but removes the explicit plan-generation step and shows no accuracy improvement would falsify the central claim.
read the original abstract
Large language models (LLMs) have recently been shown to deliver impressive performance in various NLP tasks. To tackle multi-step reasoning tasks, few-shot chain-of-thought (CoT) prompting includes a few manually crafted step-by-step reasoning demonstrations which enable LLMs to explicitly generate reasoning steps and improve their reasoning task accuracy. To eliminate the manual effort, Zero-shot-CoT concatenates the target problem statement with "Let's think step by step" as an input prompt to LLMs. Despite the success of Zero-shot-CoT, it still suffers from three pitfalls: calculation errors, missing-step errors, and semantic misunderstanding errors. To address the missing-step errors, we propose Plan-and-Solve (PS) Prompting. It consists of two components: first, devising a plan to divide the entire task into smaller subtasks, and then carrying out the subtasks according to the plan. To address the calculation errors and improve the quality of generated reasoning steps, we extend PS prompting with more detailed instructions and derive PS+ prompting. We evaluate our proposed prompting strategy on ten datasets across three reasoning problems. The experimental results over GPT-3 show that our proposed zero-shot prompting consistently outperforms Zero-shot-CoT across all datasets by a large margin, is comparable to or exceeds Zero-shot-Program-of-Thought Prompting, and has comparable performance with 8-shot CoT prompting on the math reasoning problem. The code can be found at https://github.com/AGI-Edgerunners/Plan-and-Solve-Prompting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Plan-and-Solve (PS) prompting to improve zero-shot chain-of-thought reasoning in LLMs. PS first generates an explicit plan to decompose the task into subtasks, then solves the subtasks according to the plan. PS+ augments this with more detailed instructions to reduce calculation errors. On ten datasets spanning math, commonsense, and symbolic reasoning, PS prompting outperforms Zero-shot-CoT by a large margin, matches or exceeds Zero-shot-Program-of-Thought, and achieves performance comparable to 8-shot CoT on math tasks. Code is released for reproducibility.
Significance. If the reported gains are shown to arise specifically from the plan-then-solve decomposition rather than uncontrolled prompt variations, the method supplies a lightweight, example-free technique that directly targets missing-step errors in zero-shot reasoning. The public code release supports reproducibility and further testing.
major comments (2)
- [Experimental Setup and Results sections] The experimental comparisons do not control for prompt length or instruction density. PS and PS+ templates are substantially longer and more detailed than the Zero-shot-CoT baseline (which uses only the problem plus 'Let's think step by step'). Without an ablation that holds total token count and instructional content constant while toggling only the explicit planning phase, the accuracy lifts cannot be attributed to the two-stage structure rather than surface-level prompt engineering differences.
- [Experimental Setup] No information is provided on whether decoding parameters (temperature, top-p, max tokens) or token budgets were matched across all prompting conditions. If these were not fixed, the observed differences could partly reflect generation-length or sampling effects rather than reasoning quality.
minor comments (1)
- [Abstract] The abstract states that Zero-shot-CoT suffers from 'three pitfalls' but does not name them; listing calculation errors, missing-step errors, and semantic misunderstanding errors would improve immediate clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on experimental controls. We address each major point below and outline revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experimental Setup and Results sections] The experimental comparisons do not control for prompt length or instruction density. PS and PS+ templates are substantially longer and more detailed than the Zero-shot-CoT baseline (which uses only the problem plus 'Let's think step by step'). Without an ablation that holds total token count and instructional content constant while toggling only the explicit planning phase, the accuracy lifts cannot be attributed to the two-stage structure rather than surface-level prompt engineering differences.
Authors: We agree that differences in prompt length and instructional density are a valid concern and that the current results do not fully isolate the contribution of the explicit plan-then-solve structure from other prompt-engineering factors. The added length in PS/PS+ arises from the instructions to generate a plan and solve subtasks, which directly target missing-step errors. To address this rigorously, we will add a new ablation in the revised manuscript that compares against a length-matched control prompt containing generic detailed instructions but omitting the explicit planning step. This will help attribute gains more precisely to the two-stage decomposition. revision: yes
-
Referee: [Experimental Setup] No information is provided on whether decoding parameters (temperature, top-p, max tokens) or token budgets were matched across all prompting conditions. If these were not fixed, the observed differences could partly reflect generation-length or sampling effects rather than reasoning quality.
Authors: All experiments used identical decoding parameters: temperature = 0 (for deterministic outputs), top_p = 1.0, and max_tokens = 512 (sufficient to avoid truncation on all datasets). These settings were applied uniformly. We will explicitly document these parameters in the Experimental Setup section of the revised manuscript. revision: yes
Circularity Check
No circularity: purely empirical prompting proposal with direct dataset comparisons
full rationale
The manuscript introduces Plan-and-Solve prompting as an engineering response to observed error modes in Zero-shot-CoT, then reports accuracy numbers on ten fixed datasets against published baselines. No equations, fitted parameters, uniqueness theorems, or self-citation chains are invoked to derive the method or its performance; the central claim is the measured lift itself, which is externally falsifiable on the same public benchmarks and does not reduce to any input by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can generate and follow explicit multi-step plans when given appropriate zero-shot instructions
Forward citations
Cited by 18 Pith papers
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.
-
Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation
AWARE augments generative next-POI recommendation with LLM agents that produce user-anchored narratives capturing events, culture, and trends, delivering up to 12.4% relative gains on three real datasets.
-
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
-
LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning
LEAD lets LLMs solve checkers jumping puzzles up to size 13 by using lookahead to recover from irreversible errors on hard steps that break extreme decomposition.
-
Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and Ext...
-
Measuring Faithfulness in Chain-of-Thought Reasoning
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
-
LoopTrap: Termination Poisoning Attacks on LLM Agents
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
-
PrismaDV: Automated Task-Aware Data Unit Test Generation
PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt ...
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
-
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents
ValuePlanner is a hierarchical architecture that uses LLMs to generate value-based subgoals and PDDL planners to produce executable actions, enabling self-directed behavior in embodied agents.
-
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
-
Understanding the planning of LLM agents: A survey
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
Reference graph
Works this paper leans on
-
[1]
Transactions of the Association for Computa- tional Linguistics, 3:585–597
Parsing algebraic word problems into equa- tions. Transactions of the Association for Computa- tional Linguistics, 3:585–597. Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. MAWPS: A math word problem repository. In Proceedings of NAACL, pages 1152–1157. Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jia...
-
[2]
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
Llm+ p: Empowering large language mod- els with optimal planning proficiency. arXiv preprint arXiv:2304.11477. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining ap- proach. arXiv preprint arXiv:1907.11692. Pan Lu, Li...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[3]
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis
Are NLP models really able to solve simple math word problems? In Proceedings of NAACL, pages 2080–2094. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350. Subhro Roy and Dan Roth. 2016. Solving gen- eral arithmetic wor...
-
[4]
LaMDA: Language Models for Dialog Applications
Lamda: Language models for dialog applica- tions. arXiv preprint arXiv:2201.08239. Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. 2022a. Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, a...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Convert A cents to dollars
-
[6]
Calculate the cost of B bottles of soda Solution:
-
[7]
Convert A cents to dollars: A cents = A/100 dollars
-
[8]
Q: James decides to run 3 sprints 3 times a week
Calculate the cost of B bottles of soda: B bottles of soda = (A/100) * (B/27) = AB/2700 dollars Answer: (A) AB/2700 Table 18: Example outputs by Zero-shot-PS+ for GSM8K. Q: James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week? A: Let’s first understand the problem, extract relevant variable...
-
[9]
Calculate the number of students enrolled in contemporary dance
-
[10]
Calculate the number of students remaining after subtracting the number of students enrolled in contemporary dance
-
[11]
Calculate the number of students enrolled in jazz dance
-
[12]
Calculate the number of students enrolled in hip-hop dance
- [13]
-
[14]
Number of students enrolled in contemporary dance = 20% of 20 = 20% × 20 = 4
-
[15]
Number of students remaining after subtracting the number of students enrolled in contemporary dance = 20 - 4 = 16
-
[16]
Number of students enrolled in jazz dance = 25% of 16 = 25% × 16 = 4
-
[17]
Number of students enrolled in hip-hop dance = 16 - 4 = 12
-
[18]
Table 19: Example outputs by Zero-shot-PS+ for MultiArith
Percentage of students enrolled in hip-hop dance = 12/20 × 100% = 60% Answer: 60% of the entire students enrolled in hip-hop dance. Table 19: Example outputs by Zero-shot-PS+ for MultiArith. Q: The school cafeteria ordered 42 red apples and 7 green apples for students lunches. But, if only 9 students wanted fruit, how many extra did the cafeteria end up w...
-
[19]
If only 38 of the carrots were good, how many bad carrots did they have? A: Let’s first understand the problem, extract relevant variables and their corresponding numerals, and devise a complete plan.Then, let’s carry out the plan, calculate intermediate variables (pay attention to correct numerical calculation and commonsense), solve the problem step by ...
work page 1976
-
[20]
Determine if the Soviet Union still exists
-
[21]
If the Soviet Union still exists, answer the question
-
[22]
If the Soviet Union does not exist, determine which countries it has been replaced by. Answer:
-
[23]
The Soviet Union no longer exists
-
[24]
No, the country that received the most gold medals during the 1976 Olympics does not still exist
work page 1976
-
[25]
The Soviet Union has been replaced by 15 independent countries, including Russia, Ukraine, Belarus, and Kazakhstan. Q:Is the Mona Lisa in the same museum as the Venus de Milo? A: Let’s first prepare relevant information and make a plan. Then, let’s answer the question step by step (pay attention to commonsense and logical coherence). Output: Relevant info...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.