Decomposed Prompting: A Modular Approach for Solving Complex Tasks
Pith reviewed 2026-05-19 06:07 UTC · model grok-4.3
The pith
Decomposed Prompting lets LLMs solve complex tasks by splitting them into simpler sub-tasks handled by specialized prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decomposed Prompting decomposes a complex task into simpler sub-tasks via prompting and delegates each to a library of specialized prompting-based LLMs. The modular design lets every prompt be optimized for its sub-task, allows recursive decomposition of hard sub-tasks or long inputs, and supports replacement of any component with a stronger prompt, trained model, or symbolic routine. Experiments demonstrate that the method outperforms prior few-shot prompting with GPT-3 on symbolic reasoning tasks, long-context multi-hop QA, and open-domain multi-hop QA that incorporates symbolic retrieval.
What carries the argument
Decomposed Prompting, the mechanism that uses an initial prompt to identify sub-tasks and then assigns each to a dedicated, optimizable prompt-based solver.
If this is right
- On symbolic reasoning, sub-tasks that remain hard for LLMs can themselves be broken into even simpler solvable pieces.
- When complexity stems from input length, the same task can be applied recursively to smaller input segments.
- Long-context multi-hop QA improves when each reasoning sub-task receives its own focused prompt rather than a single combined prompt.
- Open-domain multi-hop QA gains when a symbolic information-retrieval step is inserted as one module inside the decomposition.
Where Pith is reading between the lines
- The same decomposition structure could let developers swap in future specialized models for individual sub-tasks without retraining the entire pipeline.
- Explicit sub-task boundaries make it easier to diagnose which part of a complex problem an LLM is failing on.
- Hybrid systems that combine neural prompts with classical symbolic algorithms become simpler to assemble once each sub-task has its own interface.
Load-bearing premise
Sub-tasks identified by prompting can be solved independently without losing critical context or interdependencies that exist in the original task.
What would settle it
A controlled comparison on a task with strong cross-sub-task dependencies, using identical GPT-3 back-ends for both a single direct prompt and the decomposed version, that shows no accuracy gain for the decomposed approach.
read the original abstract
Few-shot prompting is a surprisingly powerful way to use Large Language Models (LLMs) to solve various tasks. However, this approach struggles as the task complexity increases or when the individual reasoning steps of the task themselves are hard to learn, especially when embedded in more complex tasks. To address this, we propose Decomposed Prompting, a new approach to solve complex tasks by decomposing them (via prompting) into simpler sub-tasks that can be delegated to a library of prompting-based LLMs dedicated to these sub-tasks. This modular structure allows each prompt to be optimized for its specific sub-task, further decomposed if necessary, and even easily replaced with more effective prompts, trained models, or symbolic functions if desired. We show that the flexibility and modularity of Decomposed Prompting allows it to outperform prior work on few-shot prompting using GPT3. On symbolic reasoning tasks, we can further decompose sub-tasks that are hard for LLMs into even simpler solvable sub-tasks. When the complexity comes from the input length, we can recursively decompose the task into the same task but with smaller inputs. We also evaluate our approach on textual multi-step reasoning tasks: on long-context multi-hop QA task, we can more effectively teach the sub-tasks via our separate sub-tasks prompts; and on open-domain multi-hop QA, we can incorporate a symbolic information retrieval within our decomposition framework, leading to improved performance on both tasks. Datasets, Code and Prompts available at https://github.com/allenai/DecomP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Decomposed Prompting, a modular framework that uses prompting to break complex tasks into simpler sub-tasks, each solved by a dedicated LLM prompt (or further decomposed, or replaced by symbolic functions). It reports empirical gains over standard few-shot GPT-3 prompting on symbolic reasoning (via recursive sub-task decomposition), long-context multi-hop QA (via separate sub-task teaching), and open-domain multi-hop QA (via integration of symbolic retrieval).
Significance. If the results hold under fuller controls, the approach offers a practical route to scale few-shot prompting to harder reasoning problems by enabling per-sub-task optimization and hybrid symbolic-neural pipelines. The recursive decomposition for input-length issues and the explicit modularity for swapping components are concrete strengths that could influence subsequent work on compositional LLM use.
major comments (2)
- [§4] §4 (multi-hop QA experiments): the central claim that separate sub-task prompts improve performance rests on the assumption that sub-answers can be produced independently without loss of inter-task dependencies or original-query constraints. The manuscript provides no ablation that inserts explicit state variables or chained context between sub-prompts, so it remains unclear whether observed gains are due to modularity or simply to more careful prompt engineering.
- [§3.2] §3.2 and Table 2 (symbolic reasoning results): the paper states that further decomposition of hard sub-tasks yields solvable units, yet no controlled comparison is shown between (a) the full decomposed pipeline and (b) a single prompt that receives the same total number of in-context examples but without explicit decomposition. This leaves open whether the reported accuracy lift is attributable to the modular structure or to the increased total supervision.
minor comments (2)
- [Appendix / reproducibility] The GitHub repository is cited for code and prompts; the manuscript should include a short appendix table mapping each reported experiment to the exact prompt files used.
- [Figure 1] Figure 1 (decomposition diagram): arrows between sub-task modules are not labeled with the exact information passed (e.g., whether the original query or prior sub-answers are included), which reduces clarity for readers trying to replicate the flow.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our design choices and proposed revisions to strengthen the empirical support for the claims.
read point-by-point responses
-
Referee: [§4] §4 (multi-hop QA experiments): the central claim that separate sub-task prompts improve performance rests on the assumption that sub-answers can be produced independently without loss of inter-task dependencies or original-query constraints. The manuscript provides no ablation that inserts explicit state variables or chained context between sub-prompts, so it remains unclear whether observed gains are due to modularity or simply to more careful prompt engineering.
Authors: We agree that an explicit ablation would help isolate the contribution of modularity. Our decomposition is intentionally designed so that each sub-task prompt receives a self-contained formulation derived from the original query, reducing the need for persistent state across steps. In the revised manuscript we will add an ablation that inserts explicit state variables or chained context between sub-prompts and reports the resulting performance, allowing direct comparison to the independent-subtask setting used in the paper. revision: yes
-
Referee: [§3.2] §3.2 and Table 2 (symbolic reasoning results): the paper states that further decomposition of hard sub-tasks yields solvable units, yet no controlled comparison is shown between (a) the full decomposed pipeline and (b) a single prompt that receives the same total number of in-context examples but without explicit decomposition. This leaves open whether the reported accuracy lift is attributable to the modular structure or to the increased total supervision.
Authors: This is a fair point about isolating the effect of structure versus supervision volume. The core benefit we emphasize is that decomposition permits both distribution of examples across specialized prompts and recursive breakdown of otherwise intractable sub-tasks. We will add to the revision a controlled comparison in which a single non-decomposed prompt is given the identical total number of in-context examples (aggregated from the sub-task prompts) and show that it underperforms the modular pipeline, thereby demonstrating that the explicit decomposition contributes beyond raw example count. revision: yes
Circularity Check
No circularity: empirical method validated on benchmarks
full rationale
The paper introduces Decomposed Prompting as a modular prompting technique for complex tasks, evaluated empirically on symbolic reasoning, long-context multi-hop QA, and open-domain QA using GPT-3. Claims of outperformance rest on experimental results rather than any derivation, equation, or prediction that reduces to its own inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The sub-task decomposition is a design choice tested via benchmarks, not a tautological redefinition of inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 20 Pith papers
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...
-
From Gaze to Guidance: Interpreting and Adapting to Users' Cognitive Needs with Multimodal Gaze-Aware AI Assistants
A gaze-aware LLM assistant using egocentric video with gaze overlays outperforms text-only LLMs in accuracy of reading behavior assessment, personalization, information recall, and interaction efficiency in a 36-person study.
-
Training Large Language Models to Reason in a Continuous Latent Space
Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency t...
-
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prio...
-
STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs
STaD generates controlled scaffolded variations of reasoning benchmarks to identify unique compositional skill gaps across different LLMs.
-
Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
TRACE prompting induces MLLMs to produce textual allocentric 3D representations from video, yielding consistent gains on spatial QA benchmarks across multiple model backbones.
-
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
-
QuiLL: An LLM-Based Vulnerability Assessment Framework for the Wild
QuiLL is a new evaluation pipeline that uses optimized LLM prompts, dynamic in-context learning from an NVD vector store, and a novel accuracy-plus-reasoning metric to benchmark vulnerability detection in real code.
-
Towards Explorative IRBL: Combining Semantic Retrieval with LLM-driven Iterative Code Exploration
GenLoc integrates semantic retrieval and LLM-based iterative code exploration to outperform prior IRBL and LLM methods on Java and Python bug localization benchmarks.
-
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
-
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
FrugalGPT learns query-specific cascades across heterogeneous LLM APIs to match or exceed top-model accuracy at far lower cost.
-
Teaching Large Language Models to Self-Debug
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
-
ART: Automatic multi-step reasoning and tool-use for large language models
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
-
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.
-
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
-
OOPrompt: Reifying Intents into Structured Artifacts for Modular and Iterative Prompting
OOPrompt reifies user intents into structured manipulable artifacts to enable modular and iterative prompting in LLM-based interactive systems.
-
Fine-grained Multi-Document Extraction and Generation of Code Change Rationale
ARGUS extracts fragmented code change rationales from multiple documents using LLMs and generates summaries that developers rate as useful for review and maintenance.
-
Large Language Model-Brained GUI Agents: A Survey
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.