arxiv: 2210.03350 · v3 · pith:DJHSBN54new · submitted 2022-10-07 · 💻 cs.CL

Measuring and Narrowing the Compositionality Gap in Language Models

Ofir Press , Muru Zhang , Sewon Min , Ludwig Schmidt , Noah A. Smith , Mike Lewis This is my paper

Pith reviewed 2026-05-17 17:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords compositionality gapmulti-hop question answeringlanguage modelsprompting methodschain of thoughtself-askGPT-3compositional reasoning

0 comments

The pith

Larger language models improve single-fact recall faster than they improve the ability to compose multiple facts into answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures a compositionality gap by testing how often models correctly solve the individual facts in a question but fail when those facts must be combined into one overall answer. Using multi-hop questions built from facts unlikely to appear together in training data, the authors track this gap across the GPT-3 model family. They find that single-hop accuracy rises with scale while multi-hop accuracy lags, so the gap does not shrink. The work then shows that prompting the model to reason explicitly, such as with chain-of-thought, reduces the gap, and introduces a new self-ask method that improves further by generating and answering its own follow-up questions before the final answer. Self-ask also makes it straightforward to attach an external search engine to answer those follow-ups and raise accuracy more.

Core claim

In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. Elicitive prompting such as chain of thought narrows the compositionality gap by reasoning explicitly, and the new self-ask method further improves on it by having the model explicitly ask itself and answer follow-up questions before answering the initial question,

What carries the argument

The compositionality gap, the ratio of cases where a model answers all sub-problems correctly yet fails to produce the correct overall solution to a multi-hop question.

If this is right

Pure scaling of next-token prediction models will not automatically close the compositionality gap.
Explicit step-by-step prompting can narrow the gap without any change to model weights or training data.
Structured prompting like self-ask makes it simple to insert external tools such as search engines into the reasoning chain.
Accuracy on multi-hop questions can be improved by separating the generation of intermediate questions from the final answer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The persistent gap suggests that next-token training may reward memorization of surface patterns more than the internal construction of composed answers.
Self-ask could be tested on other multi-step domains such as arithmetic word problems or logical deduction to see if the same narrowing effect appears.
If the gap remains even in much larger models, then new pretraining objectives that explicitly reward intermediate reasoning steps may be needed.

Load-bearing premise

The multi-hop questions are built from facts unlikely to have been observed together during pretraining, forcing the model to compose rather than recall the full answer directly.

What would settle it

Running the scaling experiment on a new set of multi-hop questions where the component facts are known to co-occur frequently in the pretraining corpus and checking whether the gap shrinks or disappears.

read the original abstract

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling boosts single-hop accuracy faster than multi-hop in GPT-3, leaving the compositionality gap intact, while self-ask prompting narrows it by forcing explicit sub-questions.

read the letter

The main thing to know is that this paper measures a persistent compositionality gap in language models: larger GPT-3 variants improve on single facts quicker than on questions that require stitching two facts together, so the gap does not shrink with scale. They also introduce self-ask, a prompting approach where the model first generates and answers its own follow-up questions before tackling the original one, and show it beats chain-of-thought, especially when a search tool can be attached to the sub-questions.

Referee Report

2 major / 2 minor

Summary. The paper measures the compositionality gap in language models—the rate at which models correctly answer sub-questions but fail on the full multi-hop question requiring composition of those facts. Using multi-hop questions built from facts unlikely to co-occur in pretraining, the authors report that in the GPT-3 family single-hop QA accuracy scales faster with model size than multi-hop accuracy, so the gap does not shrink. They introduce self-ask prompting (model generates and answers its own follow-up questions) which narrows the gap, and show further gains when self-ask is combined with external search.

Significance. If the measurement is robust, the result that compositionality does not improve with scale (while factual recall does) is a useful empirical finding for understanding LM limitations. The self-ask method and its search-engine extension provide a concrete, reproducible prompting technique that improves multi-hop performance. The work is strongest in its clear single-hop vs. multi-hop comparison and the practical elicitation method; it would be strengthened by tighter controls on the memorization assumption.

major comments (2)

[Abstract and question-generation description] The central scaling claim (single-hop improves faster than multi-hop, so the compositionality gap does not decrease) rests on the assumption that correct multi-hop answers require composition rather than retrieval of a pre-seen joint fact. The manuscript should add explicit controls or analysis (e.g., n-gram overlap checks, paraphrase tests, or training-data co-occurrence statistics for the constructed multi-hop items) to substantiate that the facts are unlikely to have been observed together; without this, slower multi-hop scaling could reflect prompt length, retrieval difficulty, or surface differences instead of a compositionality limit.
[Results on prompting variants] The paper reports that self-ask further improves on chain-of-thought, but the results section should include an ablation isolating the contribution of the explicit follow-up question generation versus simply lengthening the prompt or adding more reasoning steps; this is needed to confirm that the structured self-asking mechanism is the operative factor.

minor comments (2)

[Introduction / Methods] Clarify the exact definition and formula for the compositionality gap (P(sub-questions correct and full answer wrong)) in the main text, including how ties or partial credit are handled.
[Scaling experiments] Add error bars or statistical significance tests for the scaling trends across GPT-3 sizes to support the claim that single-hop improves faster than multi-hop.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and have revised the paper where appropriate to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract and question-generation description] The central scaling claim (single-hop improves faster than multi-hop, so the compositionality gap does not decrease) rests on the assumption that correct multi-hop answers require composition rather than retrieval of a pre-seen joint fact. The manuscript should add explicit controls or analysis (e.g., n-gram overlap checks, paraphrase tests, or training-data co-occurrence statistics for the constructed multi-hop items) to substantiate that the facts are unlikely to have been observed together; without this, slower multi-hop scaling could reflect prompt length, retrieval difficulty, or surface differences instead of a compositionality limit.

Authors: We thank the referee for this suggestion, which helps clarify the interpretation of our scaling results. In the revised manuscript we have added an n-gram overlap analysis showing minimal lexical overlap between sub-questions and the full multi-hop questions. We have also included paraphrase robustness checks in which we reworded the multi-hop questions and observed that the compositionality gap persists. These additions are now reported in the question-generation and results sections. We note, however, that direct co-occurrence statistics from the GPT-3 pretraining corpus cannot be computed because that data is proprietary and inaccessible to us; our construction procedure instead selects facts from semantically distant domains to reduce the chance of joint observation. revision: partial
Referee: [Results on prompting variants] The paper reports that self-ask further improves on chain-of-thought, but the results section should include an ablation isolating the contribution of the explicit follow-up question generation versus simply lengthening the prompt or adding more reasoning steps; this is needed to confirm that the structured self-asking mechanism is the operative factor.

Authors: We agree that an ablation isolating the structured question-generation component is useful. In the revised results section we now report an additional control experiment that matches self-ask for prompt length and number of reasoning steps but omits the explicit follow-up question generation step. The structured self-ask variant continues to outperform this length-and-step-matched baseline, supporting that the explicit question-asking mechanism contributes beyond mere prompt expansion. These new results and accompanying discussion have been added to the paper. revision: yes

standing simulated objections not resolved

Direct computation of training-data co-occurrence statistics for the GPT-3 family, because the pretraining corpus is not publicly released.

Circularity Check

0 steps flagged

No significant circularity in empirical measurement paper

full rationale

The paper is an empirical study that defines the compositionality gap operationally as the rate at which models answer sub-questions correctly but fail on the full multi-hop question, then reports direct measurements of this gap across GPT-3 model sizes on a constructed dataset. No mathematical derivations, parameter fits, or self-referential equations appear in the provided text or abstract. The central scaling observation (single-hop accuracy improving faster than multi-hop) is presented as an experimental result rather than a derived claim that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz that would force the reported outcome. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen multi-hop questions test genuine composition rather than memorized co-occurrences.

axioms (1)

domain assumption Multi-hop questions can be constructed from facts unlikely to have co-occurred in pretraining data
This assumption is required to interpret failures on the full question as evidence of a compositionality gap rather than missing knowledge.

pith-pipeline@v0.9.0 · 5529 in / 1328 out tokens · 59976 ms · 2026-05-17T17:45:40.663909+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability RCL_is_unique_functional_form_of_logic unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG
cs.CL 2025-11 conditional novelty 7.0

TARG uses uncertainty scores from a short no-context draft to gate retrieval in RAG, matching Always-RAG accuracy while cutting retrievals by 70-90% on QA benchmarks.
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
cs.CL 2025-11 unverdicted novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
cs.CL 2023-05 conditional novelty 7.0

Plan-and-Solve prompting improves zero-shot LLM reasoning by first creating an explicit plan then executing subtasks, outperforming simple 'think step by step' prompts across ten datasets.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
cs.AI 2026-04 unverdicted novelty 6.0

OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.
How Do Language Models Compose Functions?
cs.CL 2025-10 conditional novelty 6.0

LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards
cs.CL 2025-10 unverdicted novelty 6.0

ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
cs.CL 2025-05 conditional novelty 6.0

ZeroSearch simulates search engine interactions via supervised fine-tuning of a retrieval module and curriculum-based RL degradation of document quality, achieving comparable or superior performance to real search eng...
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
cs.AI 2025-03 unverdicted novelty 6.0

R1-Searcher uses two-stage outcome-based RL to train LLMs to invoke external search systems for better reasoning without process rewards or distillation.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
cs.CL 2023-10 unverdicted novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
Chain-of-Verification Reduces Hallucination in Large Language Models
cs.CL 2023-09 unverdicted novelty 6.0

Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
ChemCrow: Augmenting large-language models with chemistry tools
physics.chem-ph 2023-04 conditional novelty 6.0

ChemCrow augments LLMs with 18 expert chemistry tools to autonomously plan and execute syntheses and guide molecular discoveries in organic synthesis, drug discovery, and materials design.
Language Models can Solve Computer Tasks
cs.CL 2023-03 accept novelty 6.0

Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
ART: Automatic multi-step reasoning and tool-use for large language models
cs.CL 2023-03 unverdicted novelty 6.0

ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
cs.AI 2026-05 unverdicted novelty 5.0

AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
cs.CL 2025-10 unverdicted novelty 5.0

EvolveR proposes a closed-loop self-evolution system for LLM agents that distills experiences into principles offline and applies reinforcement during online task interactions to achieve better performance on multi-ho...
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
cs.CL 2025-10 unverdicted novelty 5.0

ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Self-Refine: Iterative Refinement with Self-Feedback
cs.CL 2023-03 unverdicted novelty 5.0

Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
Agentic Reasoning for Large Language Models
cs.AI 2026-01 unverdicted novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping
cs.LG 2025-10 unverdicted novelty 4.0

GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 3.0

MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.