Measuring and Narrowing the Compositionality Gap in Language Models
Pith reviewed 2026-05-17 17:45 UTC · model grok-4.3
The pith
Larger language models improve single-fact recall faster than they improve the ability to compose multiple facts into answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. Elicitive prompting such as chain of thought narrows the compositionality gap by reasoning explicitly, and the new self-ask method further improves on it by having the model explicitly ask itself and answer follow-up questions before answering the initial question,
What carries the argument
The compositionality gap, the ratio of cases where a model answers all sub-problems correctly yet fails to produce the correct overall solution to a multi-hop question.
If this is right
- Pure scaling of next-token prediction models will not automatically close the compositionality gap.
- Explicit step-by-step prompting can narrow the gap without any change to model weights or training data.
- Structured prompting like self-ask makes it simple to insert external tools such as search engines into the reasoning chain.
- Accuracy on multi-hop questions can be improved by separating the generation of intermediate questions from the final answer.
Where Pith is reading between the lines
- The persistent gap suggests that next-token training may reward memorization of surface patterns more than the internal construction of composed answers.
- Self-ask could be tested on other multi-step domains such as arithmetic word problems or logical deduction to see if the same narrowing effect appears.
- If the gap remains even in much larger models, then new pretraining objectives that explicitly reward intermediate reasoning steps may be needed.
Load-bearing premise
The multi-hop questions are built from facts unlikely to have been observed together during pretraining, forcing the model to compose rather than recall the full answer directly.
What would settle it
Running the scaling experiment on a new set of multi-hop questions where the component facts are known to co-occur frequently in the pretraining corpus and checking whether the gap shrinks or disappears.
read the original abstract
We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper measures the compositionality gap in language models—the rate at which models correctly answer sub-questions but fail on the full multi-hop question requiring composition of those facts. Using multi-hop questions built from facts unlikely to co-occur in pretraining, the authors report that in the GPT-3 family single-hop QA accuracy scales faster with model size than multi-hop accuracy, so the gap does not shrink. They introduce self-ask prompting (model generates and answers its own follow-up questions) which narrows the gap, and show further gains when self-ask is combined with external search.
Significance. If the measurement is robust, the result that compositionality does not improve with scale (while factual recall does) is a useful empirical finding for understanding LM limitations. The self-ask method and its search-engine extension provide a concrete, reproducible prompting technique that improves multi-hop performance. The work is strongest in its clear single-hop vs. multi-hop comparison and the practical elicitation method; it would be strengthened by tighter controls on the memorization assumption.
major comments (2)
- [Abstract and question-generation description] The central scaling claim (single-hop improves faster than multi-hop, so the compositionality gap does not decrease) rests on the assumption that correct multi-hop answers require composition rather than retrieval of a pre-seen joint fact. The manuscript should add explicit controls or analysis (e.g., n-gram overlap checks, paraphrase tests, or training-data co-occurrence statistics for the constructed multi-hop items) to substantiate that the facts are unlikely to have been observed together; without this, slower multi-hop scaling could reflect prompt length, retrieval difficulty, or surface differences instead of a compositionality limit.
- [Results on prompting variants] The paper reports that self-ask further improves on chain-of-thought, but the results section should include an ablation isolating the contribution of the explicit follow-up question generation versus simply lengthening the prompt or adding more reasoning steps; this is needed to confirm that the structured self-asking mechanism is the operative factor.
minor comments (2)
- [Introduction / Methods] Clarify the exact definition and formula for the compositionality gap (P(sub-questions correct and full answer wrong)) in the main text, including how ties or partial credit are handled.
- [Scaling experiments] Add error bars or statistical significance tests for the scaling trends across GPT-3 sizes to support the claim that single-hop improves faster than multi-hop.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and have revised the paper where appropriate to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract and question-generation description] The central scaling claim (single-hop improves faster than multi-hop, so the compositionality gap does not decrease) rests on the assumption that correct multi-hop answers require composition rather than retrieval of a pre-seen joint fact. The manuscript should add explicit controls or analysis (e.g., n-gram overlap checks, paraphrase tests, or training-data co-occurrence statistics for the constructed multi-hop items) to substantiate that the facts are unlikely to have been observed together; without this, slower multi-hop scaling could reflect prompt length, retrieval difficulty, or surface differences instead of a compositionality limit.
Authors: We thank the referee for this suggestion, which helps clarify the interpretation of our scaling results. In the revised manuscript we have added an n-gram overlap analysis showing minimal lexical overlap between sub-questions and the full multi-hop questions. We have also included paraphrase robustness checks in which we reworded the multi-hop questions and observed that the compositionality gap persists. These additions are now reported in the question-generation and results sections. We note, however, that direct co-occurrence statistics from the GPT-3 pretraining corpus cannot be computed because that data is proprietary and inaccessible to us; our construction procedure instead selects facts from semantically distant domains to reduce the chance of joint observation. revision: partial
-
Referee: [Results on prompting variants] The paper reports that self-ask further improves on chain-of-thought, but the results section should include an ablation isolating the contribution of the explicit follow-up question generation versus simply lengthening the prompt or adding more reasoning steps; this is needed to confirm that the structured self-asking mechanism is the operative factor.
Authors: We agree that an ablation isolating the structured question-generation component is useful. In the revised results section we now report an additional control experiment that matches self-ask for prompt length and number of reasoning steps but omits the explicit follow-up question generation step. The structured self-ask variant continues to outperform this length-and-step-matched baseline, supporting that the explicit question-asking mechanism contributes beyond mere prompt expansion. These new results and accompanying discussion have been added to the paper. revision: yes
- Direct computation of training-data co-occurrence statistics for the GPT-3 family, because the pretraining corpus is not publicly released.
Circularity Check
No significant circularity in empirical measurement paper
full rationale
The paper is an empirical study that defines the compositionality gap operationally as the rate at which models answer sub-questions correctly but fail on the full multi-hop question, then reports direct measurements of this gap across GPT-3 model sizes on a constructed dataset. No mathematical derivations, parameter fits, or self-referential equations appear in the provided text or abstract. The central scaling observation (single-hop accuracy improving faster than multi-hop) is presented as an experimental result rather than a derived claim that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz that would force the reported outcome. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-hop questions can be constructed from facts unlikely to have co-occurred in pretraining data
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.InevitabilityRCL_is_unique_functional_form_of_logic unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 25 Pith papers
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG
TARG uses uncertainty scores from a short no-context draft to gate retrieval in RAG, matching Always-RAG accuracy while cutting retrievals by 70-90% on QA benchmarks.
-
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
-
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
Plan-and-Solve prompting improves zero-shot LLM reasoning by first creating an explicit plan then executing subtasks, outperforming simple 'think step by step' prompts across ten datasets.
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
-
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.
-
How Do Language Models Compose Functions?
LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.
-
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards
ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.
-
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
ZeroSearch simulates search engine interactions via supervised fine-tuning of a retrieval module and curriculum-based RL degradation of document quality, achieving comparable or superior performance to real search eng...
-
ToolRL: Reward is All Tool Learning Needs
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
-
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
R1-Searcher uses two-stage outcome-based RL to train LLMs to invoke external search systems for better reasoning without process rewards or distillation.
-
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
-
Chain-of-Verification Reduces Hallucination in Large Language Models
Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
-
ChemCrow: Augmenting large-language models with chemistry tools
ChemCrow augments LLMs with 18 expert chemistry tools to autonomously plan and execute syntheses and guide molecular discoveries in organic synthesis, drug discovery, and materials design.
-
Language Models can Solve Computer Tasks
Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
-
ART: Automatic multi-step reasoning and tool-use for large language models
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
-
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
-
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
EvolveR proposes a closed-loop self-evolution system for LLM agents that distills experiences into principles offline and applies reinforcement during online task interactions to achieve better performance on multi-ho...
-
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Self-Refine: Iterative Refinement with Self-Feedback
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
-
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping
GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
-
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation
MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.