Decomposed Prompting: A Modular Approach for Solving Complex Tasks

Ashish Sabharwal; Harsh Trivedi; Kyle Richardson; Matthew Finlayson; Peter Clark; Tushar Khot; Yao Fu

arxiv: 2210.02406 · v2 · pith:YRDAXCUM · submitted 2022-10-05 · cs.CL

Decomposed Prompting: A Modular Approach for Solving Complex Tasks

Tushar Khot , Harsh Trivedi , Matthew Finlayson , Yao Fu , Kyle Richardson , Peter Clark , Ashish Sabharwal This is my paper

Reviewed by Pith T0 review T1 audit T2 compute T3 formal T4 kernel 2026-05-19 06:07 UTCgrok-4.3pith:YRDAXCUM record.json open to challenge →

classification cs.CL

keywords decomposed promptingfew-shot promptinglarge language modelstask decompositionmulti-hop question answeringsymbolic reasoningmodular methods

0 comments

The pith

Decomposed Prompting lets LLMs solve complex tasks by splitting them into simpler sub-tasks handled by specialized prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard few-shot prompting with large language models loses effectiveness once tasks grow complex or require many hard reasoning steps. Decomposed Prompting counters this by first using a prompt to break the original task into independent sub-tasks, then routing each sub-task to its own dedicated prompt-based solver. Each solver can be tuned separately, further decomposed when needed, or swapped for a symbolic function or a different model. This structure produces higher accuracy than direct few-shot prompting on symbolic reasoning problems, long-context multi-hop QA, and open-domain QA that mixes retrieval with reasoning.

Core claim

Decomposed Prompting decomposes a complex task into simpler sub-tasks via prompting and delegates each to a library of specialized prompting-based LLMs. The modular design lets every prompt be optimized for its sub-task, allows recursive decomposition of hard sub-tasks or long inputs, and supports replacement of any component with a stronger prompt, trained model, or symbolic routine. Experiments demonstrate that the method outperforms prior few-shot prompting with GPT-3 on symbolic reasoning tasks, long-context multi-hop QA, and open-domain multi-hop QA that incorporates symbolic retrieval.

What carries the argument

Decomposed Prompting, the mechanism that uses an initial prompt to identify sub-tasks and then assigns each to a dedicated, optimizable prompt-based solver.

If this is right

On symbolic reasoning, sub-tasks that remain hard for LLMs can themselves be broken into even simpler solvable pieces.
When complexity stems from input length, the same task can be applied recursively to smaller input segments.
Long-context multi-hop QA improves when each reasoning sub-task receives its own focused prompt rather than a single combined prompt.
Open-domain multi-hop QA gains when a symbolic information-retrieval step is inserted as one module inside the decomposition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition structure could let developers swap in future specialized models for individual sub-tasks without retraining the entire pipeline.
Explicit sub-task boundaries make it easier to diagnose which part of a complex problem an LLM is failing on.
Hybrid systems that combine neural prompts with classical symbolic algorithms become simpler to assemble once each sub-task has its own interface.

Load-bearing premise

Sub-tasks identified by prompting can be solved independently without losing critical context or interdependencies that exist in the original task.

What would settle it

A controlled comparison on a task with strong cross-sub-task dependencies, using identical GPT-3 back-ends for both a single direct prompt and the decomposed version, that shows no accuracy gain for the decomposed approach.

read the original abstract

Few-shot prompting is a surprisingly powerful way to use Large Language Models (LLMs) to solve various tasks. However, this approach struggles as the task complexity increases or when the individual reasoning steps of the task themselves are hard to learn, especially when embedded in more complex tasks. To address this, we propose Decomposed Prompting, a new approach to solve complex tasks by decomposing them (via prompting) into simpler sub-tasks that can be delegated to a library of prompting-based LLMs dedicated to these sub-tasks. This modular structure allows each prompt to be optimized for its specific sub-task, further decomposed if necessary, and even easily replaced with more effective prompts, trained models, or symbolic functions if desired. We show that the flexibility and modularity of Decomposed Prompting allows it to outperform prior work on few-shot prompting using GPT3. On symbolic reasoning tasks, we can further decompose sub-tasks that are hard for LLMs into even simpler solvable sub-tasks. When the complexity comes from the input length, we can recursively decompose the task into the same task but with smaller inputs. We also evaluate our approach on textual multi-step reasoning tasks: on long-context multi-hop QA task, we can more effectively teach the sub-tasks via our separate sub-tasks prompts; and on open-domain multi-hop QA, we can incorporate a symbolic information retrieval within our decomposition framework, leading to improved performance on both tasks. Datasets, Code and Prompts available at https://github.com/allenai/DecomP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Decomposed Prompting adds a modular decomposition layer on top of few-shot prompting that helps on some symbolic and multi-hop tasks, but the gains may trace more to prompt tuning than to true independence of sub-tasks.

read the letter

Decomposed Prompting breaks complex tasks into simpler sub-tasks via prompting, then hands each one to a dedicated prompt or symbolic module. The central claim is that this structure beats plain few-shot prompting with GPT-3 on the symbolic reasoning and multi-hop QA tasks they report, and that you can recurse on long inputs or swap in retrieval functions when needed. Code and prompts are on GitHub, which is helpful for anyone who wants to try it directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes Decomposed Prompting, a modular framework that uses prompting to break complex tasks into simpler sub-tasks, each solved by a dedicated LLM prompt (or further decomposed, or replaced by symbolic functions). It reports empirical gains over standard few-shot GPT-3 prompting on symbolic reasoning (via recursive sub-task decomposition), long-context multi-hop QA (via separate sub-task teaching), and open-domain multi-hop QA (via integration of symbolic retrieval).

Significance. If the results hold under fuller controls, the approach offers a practical route to scale few-shot prompting to harder reasoning problems by enabling per-sub-task optimization and hybrid symbolic-neural pipelines. The recursive decomposition for input-length issues and the explicit modularity for swapping components are concrete strengths that could influence subsequent work on compositional LLM use.

major comments (2)

[§4] §4 (multi-hop QA experiments): the central claim that separate sub-task prompts improve performance rests on the assumption that sub-answers can be produced independently without loss of inter-task dependencies or original-query constraints. The manuscript provides no ablation that inserts explicit state variables or chained context between sub-prompts, so it remains unclear whether observed gains are due to modularity or simply to more careful prompt engineering.
[§3.2] §3.2 and Table 2 (symbolic reasoning results): the paper states that further decomposition of hard sub-tasks yields solvable units, yet no controlled comparison is shown between (a) the full decomposed pipeline and (b) a single prompt that receives the same total number of in-context examples but without explicit decomposition. This leaves open whether the reported accuracy lift is attributable to the modular structure or to the increased total supervision.

minor comments (2)

[Appendix / reproducibility] The GitHub repository is cited for code and prompts; the manuscript should include a short appendix table mapping each reported experiment to the exact prompt files used.
[Figure 1] Figure 1 (decomposition diagram): arrows between sub-task modules are not labeled with the exact information passed (e.g., whether the original query or prior sub-answers are included), which reduces clarity for readers trying to replicate the flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our design choices and proposed revisions to strengthen the empirical support for the claims.

read point-by-point responses

Referee: [§4] §4 (multi-hop QA experiments): the central claim that separate sub-task prompts improve performance rests on the assumption that sub-answers can be produced independently without loss of inter-task dependencies or original-query constraints. The manuscript provides no ablation that inserts explicit state variables or chained context between sub-prompts, so it remains unclear whether observed gains are due to modularity or simply to more careful prompt engineering.

Authors: We agree that an explicit ablation would help isolate the contribution of modularity. Our decomposition is intentionally designed so that each sub-task prompt receives a self-contained formulation derived from the original query, reducing the need for persistent state across steps. In the revised manuscript we will add an ablation that inserts explicit state variables or chained context between sub-prompts and reports the resulting performance, allowing direct comparison to the independent-subtask setting used in the paper. revision: yes
Referee: [§3.2] §3.2 and Table 2 (symbolic reasoning results): the paper states that further decomposition of hard sub-tasks yields solvable units, yet no controlled comparison is shown between (a) the full decomposed pipeline and (b) a single prompt that receives the same total number of in-context examples but without explicit decomposition. This leaves open whether the reported accuracy lift is attributable to the modular structure or to the increased total supervision.

Authors: This is a fair point about isolating the effect of structure versus supervision volume. The core benefit we emphasize is that decomposition permits both distribution of examples across specialized prompts and recursive breakdown of otherwise intractable sub-tasks. We will add to the revision a controlled comparison in which a single non-decomposed prompt is given the identical total number of in-context examples (aggregated from the sub-task prompts) and show that it underperforms the modular pipeline, thereby demonstrating that the explicit decomposition contributes beyond raw example count. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated on benchmarks

full rationale

The paper introduces Decomposed Prompting as a modular prompting technique for complex tasks, evaluated empirically on symbolic reasoning, long-context multi-hop QA, and open-domain QA using GPT-3. Claims of outperformance rest on experimental results rather than any derivation, equation, or prediction that reduces to its own inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The sub-task decomposition is a design choice tested via benchmarks, not a tautological redefinition of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the approach relies on empirical prompting techniques and standard LLM capabilities.

pith-pipeline@v0.9.0 · 5819 in / 936 out tokens · 43807 ms · 2026-05-19T06:07:19.338583+00:00 · methodology

discussion (0)

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Transformers Provably Learn to Internalize Chain-of-Thought
cs.LG 2026-05 unverdicted novelty 8.0

L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
cs.AI 2026-05 unverdicted novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...
From Gaze to Guidance: Interpreting and Adapting to Users' Cognitive Needs with Multimodal Gaze-Aware AI Assistants
cs.HC 2026-04 unverdicted novelty 7.0

A gaze-aware LLM assistant using egocentric video with gaze overlays outperforms text-only LLMs in accuracy of reading behavior assessment, personalization, information recall, and interaction efficiency in a 36-person study.
Training Large Language Models to Reason in a Continuous Latent Space
cs.CL 2024-12 unverdicted novelty 7.0

Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency t...
TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis
cs.LG 2024-10 unverdicted novelty 7.0

TS-Reasoner is a domain-oriented agent using LLMs, computational tools, and error feedback for multi-step time series inference, showing better performance than general LLMs on understanding and reasoning benchmarks.
Stop Hand-Holding Your Coding Agent: Engineering the Loops that Replace Step-by-Step Prompting
cs.SE 2026-06 unverdicted novelty 6.0

Introduces loop engineering as a distinct practice layer for coding agents, supplies a taxonomy and verification ladder, and analyzes a hand-coded corpus of fifty real loops.
The Score Granularity Gap in Black-Box LLM Classification: A Comparative Study of Confidence Constructions
cs.CL 2026-06 unverdicted novelty 6.0

Comparative evaluation of seven confidence constructions across 25 LLM-dataset pairs reveals that verbalized scores provide good ranking but coarse granularity for thresholding, while multi-query aggregation helps wea...
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
cs.CL 2026-05 unverdicted novelty 6.0

GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prio...
STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs
cs.CL 2026-04 unverdicted novelty 6.0

STaD generates controlled scaffolded variations of reasoning benchmarks to identify unique compositional skill gaps across different LLMs.
Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
cs.CV 2026-03 unverdicted novelty 6.0

TRACE prompting induces MLLMs to produce textual allocentric 3D representations from video, yielding consistent gains on spatial QA benchmarks across multiple model backbones.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
cs.CV 2026-02 unverdicted novelty 6.0

VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
QuiLL: An LLM-Based Vulnerability Assessment Framework for the Wild
cs.CR 2025-10 unverdicted novelty 6.0

QuiLL is a new evaluation pipeline that uses optimized LLM prompts, dynamic in-context learning from an NVD vector store, and a novel accuracy-plus-reasoning metric to benchmark vulnerability detection in real code.
Towards Explorative IRBL: Combining Semantic Retrieval with LLM-driven Iterative Code Exploration
cs.SE 2025-08 unverdicted novelty 6.0

GenLoc integrates semantic retrieval and LLM-based iterative code exploration to outperform prior IRBL and LLM methods on Java and Python bug localization benchmarks.
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
cs.CV 2025-07 unverdicted novelty 6.0

Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
AI Failures in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges
cs.SE 2025-03 unverdicted novelty 6.0

Mixed-methods study maps downstream developers' concerns, practices, and challenges with AI failures in PTM-based software.
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
cs.LG 2023-05 accept novelty 6.0

FrugalGPT learns query-specific cascades across heterogeneous LLM APIs to match or exceed top-model accuracy at far lower cost.
Teaching Large Language Models to Self-Debug
cs.CL 2023-04 unverdicted novelty 6.0

Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
ART: Automatic multi-step reasoning and tool-use for large language models
cs.CL 2023-03 unverdicted novelty 6.0

ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
A Technical Taxonomy of LLM Agent Communication Protocols
cs.MA 2026-06 unverdicted novelty 5.0

Creates a five-dimension taxonomy (counterparty, payload, interaction state, discovery mechanism, schema flexibility) from nine protocols and identifies architectural patterns plus convergence trends.
Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems
cs.CR 2026-05 unverdicted novelty 5.0

SkillVetBench is a two-stage benchmark combining natural-language semantic vetting and instrumented sandbox execution to detect and provide runtime evidence for malicious skills in open agent platforms, with experimen...
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
cs.LG 2026-05 unverdicted novelty 5.0

Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
cs.LG 2026-05 unverdicted novelty 5.0

Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
OOPrompt: Reifying Intents into Structured Artifacts for Modular and Iterative Prompting
cs.HC 2026-04 unverdicted novelty 5.0

OOPrompt reifies user intents into structured manipulable artifacts to enable modular and iterative prompting in LLM-based interactive systems.
Fine-grained Multi-Document Extraction and Generation of Code Change Rationale
cs.SE 2026-04 conditional novelty 5.0

ARGUS extracts fragmented code change rationales from multiple documents using LLMs and generates summaries that developers rate as useful for review and maintenance.
Efficient Reasoning with Hidden Thinking
cs.CL 2025-01 unverdicted novelty 5.0

Heima compresses verbose CoT into hidden thinking tokens via information-theoretic analysis and an adaptive interpreter, claiming maintained or improved zero-shot accuracy on reasoning benchmarks.
Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance
cs.AI 2026-05 unverdicted novelty 4.0

Proposes a modular agentic architecture for educational LLMs with stage-specific modules to incorporate pedagogical advice and improve controllability over monolithic chatbots.
Large Language Model-Brained GUI Agents: A Survey
cs.AI 2024-11 unverdicted novelty 4.0

A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.