Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
hub
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
27 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large language models (LLMs) have recently been shown to deliver impressive performance in various NLP tasks. To tackle multi-step reasoning tasks, few-shot chain-of-thought (CoT) prompting includes a few manually crafted step-by-step reasoning demonstrations which enable LLMs to explicitly generate reasoning steps and improve their reasoning task accuracy. To eliminate the manual effort, Zero-shot-CoT concatenates the target problem statement with "Let's think step by step" as an input prompt to LLMs. Despite the success of Zero-shot-CoT, it still suffers from three pitfalls: calculation errors, missing-step errors, and semantic misunderstanding errors. To address the missing-step errors, we propose Plan-and-Solve (PS) Prompting. It consists of two components: first, devising a plan to divide the entire task into smaller subtasks, and then carrying out the subtasks according to the plan. To address the calculation errors and improve the quality of generated reasoning steps, we extend PS prompting with more detailed instructions and derive PS+ prompting. We evaluate our proposed prompting strategy on ten datasets across three reasoning problems. The experimental results over GPT-3 show that our proposed zero-shot prompting consistently outperforms Zero-shot-CoT across all datasets by a large margin, is comparable to or exceeds Zero-shot-Program-of-Thought Prompting, and has comparable performance with 8-shot CoT prompting on the math reasoning problem. The code can be found at https://github.com/AGI-Edgerunners/Plan-and-Solve-Prompting.
hub tools
citation-role summary
citation-polarity summary
polarities
background 4representative citing papers
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.
AWARE augments generative next-POI recommendation with LLM agents that produce user-anchored narratives capturing events, culture, and trends, delivering up to 12.4% relative gains on three real datasets.
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
LEAD lets LLMs solve checkers jumping puzzles up to size 13 by using lookahead to recover from irreversible errors on hard steps that break extreme decomposition.
LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and ExtraSteps causing minimal degradation.
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt optimization that beats hand-written prompts.
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
LitMOF repairs 8,771 erroneous MOF structures and adds 12,646 previously missing ones from literature, yielding a database of 186,773 computation-ready entries and demonstrating that uncorrected errors distort adsorption rankings.
QuiLL is a new evaluation pipeline that uses optimized LLM prompts, dynamic in-context learning from an NVD vector store, and a novel accuracy-plus-reasoning metric to benchmark vulnerability detection in real code.
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.
ValuePlanner is a hierarchical architecture that uses LLMs to generate value-based subgoals and PDDL planners to produce executable actions, enabling self-directed behavior in embodied agents.
AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
Commander-GPT is a multi-agent routing framework that assigns sub-tasks in multimodal sarcasm detection to specialized LLMs coordinated by different commander models, reporting average F1 gains of 4.4% and 11.7% on MMSD and MMSD 2.0.
Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
citing papers explorer
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
-
RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.
-
Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation
AWARE augments generative next-POI recommendation with LLM agents that produce user-anchored narratives capturing events, culture, and trends, delivering up to 12.4% relative gains on three real datasets.
-
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
-
LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning
LEAD lets LLMs solve checkers jumping puzzles up to size 13 by using lookahead to recover from irreversible errors on hard steps that break extreme decomposition.
-
Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and ExtraSteps causing minimal degradation.
-
Measuring Faithfulness in Chain-of-Thought Reasoning
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
-
LoopTrap: Termination Poisoning Attacks on LLM Agents
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
-
PrismaDV: Automated Task-Aware Data Unit Test Generation
PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt optimization that beats hand-written prompts.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
-
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
-
LitMOF: An LLM Multi-Agent for Literature-Validated Metal-Organic Frameworks Database Correction and Expansion
LitMOF repairs 8,771 erroneous MOF structures and adds 12,646 previously missing ones from literature, yielding a database of 186,773 computation-ready entries and demonstrating that uncorrected errors distort adsorption rankings.
-
QuiLL: An LLM-Based Vulnerability Assessment Framework for the Wild
QuiLL is a new evaluation pipeline that uses optimized LLM prompts, dynamic in-context learning from an NVD vector store, and a novel accuracy-plus-reasoning metric to benchmark vulnerability detection in real code.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
-
Runtime-Structured Task Decomposition for Agentic Coding Systems
Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.
-
Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents
ValuePlanner is a hierarchical architecture that uses LLMs to generate value-based subgoals and PDDL planners to produce executable actions, enabling self-directed behavior in embodied agents.
-
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
-
Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection
Commander-GPT is a multi-agent routing framework that assigns sub-tasks in multimodal sarcasm detection to specialized LLMs coordinated by different commander models, reporting average F1 gains of 4.4% and 11.7% on MMSD and MMSD 2.0.
-
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.
-
Understanding the planning of LLM agents: A survey
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
-
The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences
The paper reduces a broad set of prompt engineering techniques to six core approaches and applies them to life sciences use cases while addressing common LLM pitfalls.
-
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.