pith. machine review for the scientific record. sign in

arxiv: 2305.04091 · v3 · submitted 2023-05-06 · 💻 cs.CL

Recognition: 2 theorem links

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords plan-and-solve promptingzero-shot chain-of-thoughtlarge language modelsmulti-step reasoningprompt engineeringmissing-step errorsGPT-3
0
0 comments X

The pith

Plan-and-solve prompting divides tasks into subtasks before solving them to cut missing-step errors in zero-shot chain-of-thought reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models still produce incomplete reasoning chains when simply told to think step by step. The paper introduces Plan-and-Solve prompting, which first requires the model to output an explicit plan that breaks the problem into ordered subtasks and then solves each subtask according to that plan. An extended version adds concrete instructions to also reduce calculation and semantic errors. Across ten datasets spanning arithmetic, commonsense, and symbolic reasoning, the zero-shot method outperforms plain zero-shot chain-of-thought by a large margin and reaches parity with eight-shot chain-of-thought on math problems.

Core claim

The authors establish that zero-shot prompting benefits from an explicit two-stage process: first generating a plan that decomposes the overall reasoning problem into ordered subtasks, then executing those subtasks sequentially according to the plan. This structure directly targets missing-step errors and, when augmented with instructions on calculation and semantic accuracy, yields higher final answer accuracy across multiple reasoning benchmarks.

What carries the argument

The Plan-and-Solve (PS) prompting template, which instructs the model to first output a plan and then solve according to it; the PS+ variant adds explicit guidance on avoiding calculation and misunderstanding errors.

If this is right

  • Zero-shot reasoning can approach the accuracy of few-shot chain-of-thought without any hand-crafted examples.
  • The method reduces missing-step errors across arithmetic, commonsense, and symbolic reasoning tasks.
  • Adding plan generation as an intermediate step raises final-answer accuracy on GPT-3 models by a consistent margin.
  • Performance remains competitive with program-of-thought baselines while staying simpler to implement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same explicit decomposition step could be inserted into other zero-shot strategies such as tree-of-thoughts to organize branching exploration.
  • Models appear to hold latent planning ability that surface-level step-by-step prompts do not activate.
  • Testing whether plan creation and execution benefit from separate model calls could further improve results.
  • The technique may transfer to long-horizon domains such as code generation or multi-stage scientific reasoning.

Load-bearing premise

The performance gains arise specifically from the plan-then-execute ordering rather than from prompt length or added instructions alone.

What would settle it

A control prompt that matches PS in length and instruction detail but removes the explicit plan-generation step and shows no accuracy improvement would falsify the central claim.

read the original abstract

Large language models (LLMs) have recently been shown to deliver impressive performance in various NLP tasks. To tackle multi-step reasoning tasks, few-shot chain-of-thought (CoT) prompting includes a few manually crafted step-by-step reasoning demonstrations which enable LLMs to explicitly generate reasoning steps and improve their reasoning task accuracy. To eliminate the manual effort, Zero-shot-CoT concatenates the target problem statement with "Let's think step by step" as an input prompt to LLMs. Despite the success of Zero-shot-CoT, it still suffers from three pitfalls: calculation errors, missing-step errors, and semantic misunderstanding errors. To address the missing-step errors, we propose Plan-and-Solve (PS) Prompting. It consists of two components: first, devising a plan to divide the entire task into smaller subtasks, and then carrying out the subtasks according to the plan. To address the calculation errors and improve the quality of generated reasoning steps, we extend PS prompting with more detailed instructions and derive PS+ prompting. We evaluate our proposed prompting strategy on ten datasets across three reasoning problems. The experimental results over GPT-3 show that our proposed zero-shot prompting consistently outperforms Zero-shot-CoT across all datasets by a large margin, is comparable to or exceeds Zero-shot-Program-of-Thought Prompting, and has comparable performance with 8-shot CoT prompting on the math reasoning problem. The code can be found at https://github.com/AGI-Edgerunners/Plan-and-Solve-Prompting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Plan-and-Solve (PS) prompting to improve zero-shot chain-of-thought reasoning in LLMs. PS first generates an explicit plan to decompose the task into subtasks, then solves the subtasks according to the plan. PS+ augments this with more detailed instructions to reduce calculation errors. On ten datasets spanning math, commonsense, and symbolic reasoning, PS prompting outperforms Zero-shot-CoT by a large margin, matches or exceeds Zero-shot-Program-of-Thought, and achieves performance comparable to 8-shot CoT on math tasks. Code is released for reproducibility.

Significance. If the reported gains are shown to arise specifically from the plan-then-solve decomposition rather than uncontrolled prompt variations, the method supplies a lightweight, example-free technique that directly targets missing-step errors in zero-shot reasoning. The public code release supports reproducibility and further testing.

major comments (2)
  1. [Experimental Setup and Results sections] The experimental comparisons do not control for prompt length or instruction density. PS and PS+ templates are substantially longer and more detailed than the Zero-shot-CoT baseline (which uses only the problem plus 'Let's think step by step'). Without an ablation that holds total token count and instructional content constant while toggling only the explicit planning phase, the accuracy lifts cannot be attributed to the two-stage structure rather than surface-level prompt engineering differences.
  2. [Experimental Setup] No information is provided on whether decoding parameters (temperature, top-p, max tokens) or token budgets were matched across all prompting conditions. If these were not fixed, the observed differences could partly reflect generation-length or sampling effects rather than reasoning quality.
minor comments (1)
  1. [Abstract] The abstract states that Zero-shot-CoT suffers from 'three pitfalls' but does not name them; listing calculation errors, missing-step errors, and semantic misunderstanding errors would improve immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on experimental controls. We address each major point below and outline revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experimental Setup and Results sections] The experimental comparisons do not control for prompt length or instruction density. PS and PS+ templates are substantially longer and more detailed than the Zero-shot-CoT baseline (which uses only the problem plus 'Let's think step by step'). Without an ablation that holds total token count and instructional content constant while toggling only the explicit planning phase, the accuracy lifts cannot be attributed to the two-stage structure rather than surface-level prompt engineering differences.

    Authors: We agree that differences in prompt length and instructional density are a valid concern and that the current results do not fully isolate the contribution of the explicit plan-then-solve structure from other prompt-engineering factors. The added length in PS/PS+ arises from the instructions to generate a plan and solve subtasks, which directly target missing-step errors. To address this rigorously, we will add a new ablation in the revised manuscript that compares against a length-matched control prompt containing generic detailed instructions but omitting the explicit planning step. This will help attribute gains more precisely to the two-stage decomposition. revision: yes

  2. Referee: [Experimental Setup] No information is provided on whether decoding parameters (temperature, top-p, max tokens) or token budgets were matched across all prompting conditions. If these were not fixed, the observed differences could partly reflect generation-length or sampling effects rather than reasoning quality.

    Authors: All experiments used identical decoding parameters: temperature = 0 (for deterministic outputs), top_p = 1.0, and max_tokens = 512 (sufficient to avoid truncation on all datasets). These settings were applied uniformly. We will explicitly document these parameters in the Experimental Setup section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical prompting proposal with direct dataset comparisons

full rationale

The manuscript introduces Plan-and-Solve prompting as an engineering response to observed error modes in Zero-shot-CoT, then reports accuracy numbers on ten fixed datasets against published baselines. No equations, fitted parameters, uniqueness theorems, or self-citation chains are invoked to derive the method or its performance; the central claim is the measured lift itself, which is externally falsifiable on the same public benchmarks and does not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method assumes LLMs can reliably generate and follow high-level plans when instructed; no new entities or fitted numerical parameters are introduced.

axioms (1)
  • domain assumption LLMs can generate and follow explicit multi-step plans when given appropriate zero-shot instructions
    Invoked to justify why the plan-then-solve format reduces missing-step errors

pith-pipeline@v0.9.0 · 5601 in / 1118 out tokens · 21726 ms · 2026-05-16T08:40:06.485739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    cs.CL 2023-09 unverdicted novelty 8.0

    Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

  2. RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.

  3. Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation

    cs.AI 2026-05 unverdicted novelty 7.0

    AWARE augments generative next-POI recommendation with LLM agents that produce user-anchored narratives capturing events, culture, and trends, delivering up to 12.4% relative gains on three real datasets.

  4. Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

    cs.CL 2026-04 unverdicted novelty 7.0

    Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.

  5. LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

    cs.AI 2026-03 unverdicted novelty 7.0

    LEAD lets LLMs solve checkers jumping puzzles up to size 13 by using lookahead to recover from irreversible errors on hard steps that break extreme decomposition.

  6. Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

    cs.CL 2026-02 unverdicted novelty 7.0

    LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and Ext...

  7. Measuring Faithfulness in Chain-of-Thought Reasoning

    cs.AI 2023-07 conditional novelty 7.0

    Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.

  8. LoopTrap: Termination Poisoning Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.

  9. Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.

  10. PrismaDV: Automated Task-Aware Data Unit Test Generation

    cs.LG 2026-04 unverdicted novelty 6.0

    PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt ...

  11. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  12. QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance

    cs.MA 2026-04 unverdicted novelty 6.0

    QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.

  13. Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

    cs.CL 2026-04 conditional novelty 6.0

    A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.

  14. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  15. Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    ValuePlanner is a hierarchical architecture that uses LLMs to generate value-based subgoals and PDDL planners to produce executable actions, enabling self-directed behavior in embodied agents.

  16. From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.

  17. Understanding the planning of LLM agents: A survey

    cs.AI 2024-02 accept novelty 4.0

    A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.

  18. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    cs.AI 2025-01 unverdicted novelty 3.0

    The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 18 Pith papers · 2 internal anchors

  1. [1]

    Transactions of the Association for Computa- tional Linguistics, 3:585–597

    Parsing algebraic word problems into equa- tions. Transactions of the Association for Computa- tional Linguistics, 3:585–597. Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. MAWPS: A math word problem repository. In Proceedings of NAACL, pages 1152–1157. Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jia...

  2. [2]

    LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    Llm+ p: Empowering large language mod- els with optimal planning proficiency. arXiv preprint arXiv:2304.11477. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining ap- proach. arXiv preprint arXiv:1907.11692. Pan Lu, Li...

  3. [3]

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis

    Are NLP models really able to solve simple math word problems? In Proceedings of NAACL, pages 2080–2094. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350. Subhro Roy and Dan Roth. 2016. Solving gen- eral arithmetic wor...

  4. [4]

    LaMDA: Language Models for Dialog Applications

    Lamda: Language models for dialog applica- tions. arXiv preprint arXiv:2201.08239. Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. 2022a. Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, a...

  5. [5]

    Convert A cents to dollars

  6. [6]

    Calculate the cost of B bottles of soda Solution:

  7. [7]

    Convert A cents to dollars: A cents = A/100 dollars

  8. [8]

    Q: James decides to run 3 sprints 3 times a week

    Calculate the cost of B bottles of soda: B bottles of soda = (A/100) * (B/27) = AB/2700 dollars Answer: (A) AB/2700 Table 18: Example outputs by Zero-shot-PS+ for GSM8K. Q: James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week? A: Let’s first understand the problem, extract relevant variable...

  9. [9]

    Calculate the number of students enrolled in contemporary dance

  10. [10]

    Calculate the number of students remaining after subtracting the number of students enrolled in contemporary dance

  11. [11]

    Calculate the number of students enrolled in jazz dance

  12. [12]

    Calculate the number of students enrolled in hip-hop dance

  13. [13]

    Calculation:

    Calculate the percentage of students enrolled in hip-hop dance. Calculation:

  14. [14]

    Number of students enrolled in contemporary dance = 20% of 20 = 20% × 20 = 4

  15. [15]

    Number of students remaining after subtracting the number of students enrolled in contemporary dance = 20 - 4 = 16

  16. [16]

    Number of students enrolled in jazz dance = 25% of 16 = 25% × 16 = 4

  17. [17]

    Number of students enrolled in hip-hop dance = 16 - 4 = 12

  18. [18]

    Table 19: Example outputs by Zero-shot-PS+ for MultiArith

    Percentage of students enrolled in hip-hop dance = 12/20 × 100% = 60% Answer: 60% of the entire students enrolled in hip-hop dance. Table 19: Example outputs by Zero-shot-PS+ for MultiArith. Q: The school cafeteria ordered 42 red apples and 7 green apples for students lunches. But, if only 9 students wanted fruit, how many extra did the cafeteria end up w...

  19. [19]

    If only 38 of the carrots were good, how many bad carrots did they have? A: Let’s first understand the problem, extract relevant variables and their corresponding numerals, and devise a complete plan.Then, let’s carry out the plan, calculate intermediate variables (pay attention to correct numerical calculation and commonsense), solve the problem step by ...

  20. [20]

    Determine if the Soviet Union still exists

  21. [21]

    If the Soviet Union still exists, answer the question

  22. [22]

    If the Soviet Union does not exist, determine which countries it has been replaced by. Answer:

  23. [23]

    The Soviet Union no longer exists

  24. [24]

    No, the country that received the most gold medals during the 1976 Olympics does not still exist

  25. [25]

    Lino Mariel Aditya Elisabeth

    The Soviet Union has been replaced by 15 independent countries, including Russia, Ukraine, Belarus, and Kazakhstan. Q:Is the Mona Lisa in the same museum as the Venus de Milo? A: Let’s first prepare relevant information and make a plan. Then, let’s answer the question step by step (pay attention to commonsense and logical coherence). Output: Relevant info...