{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2022:DJHSBN54HU2VV2B7SIQRIVL5FE","short_pith_number":"pith:DJHSBN54","schema_version":"1.0","canonical_sha256":"1a4f20b7bc3d355ae83f922114557d291eb3fffea673138ac709f19448d57925","source":{"kind":"arxiv","id":"2210.03350","version":3},"attestation_state":"computed","paper":{"title":"Measuring and Narrowing the Compositionality Gap in Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Larger language models improve single-fact recall faster than they improve the ability to compose multiple facts into answers.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Ludwig Schmidt, Mike Lewis, Muru Zhang, Noah A. Smith, Ofir Press, Sewon Min","submitted_at":"2022-10-07T06:50:23Z","abstract_excerpt":"We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance i"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2210.03350","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2022-10-07T06:50:23Z","cross_cats_sorted":[],"title_canon_sha256":"b51c2e7d513e3bc51f957d92064bdf2b30f25b753209fbaa8ff4a6266f98bb0d","abstract_canon_sha256":"b6f926ce0d2d9a6fb0798b7b877f894c9f7ab05f8cabc5a4854025fc477756e1"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.480592Z","signature_b64":"7rQMLq/TcwYc9YaPhXeiAFoSi5v8pEFuvikC4mS9G3s/Ij7Bcl9VWlMXt+hJEeBzCq65pP/mBo2sCj11/EcmCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"1a4f20b7bc3d355ae83f922114557d291eb3fffea673138ac709f19448d57925","last_reissued_at":"2026-05-17T23:38:13.479942Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.479942Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Measuring and Narrowing the Compositionality Gap in Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Larger language models improve single-fact recall faster than they improve the ability to compose multiple facts into answers.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Ludwig Schmidt, Mike Lewis, Muru Zhang, Noah A. Smith, Ofir Press, Sewon Min","submitted_at":"2022-10-07T06:50:23Z","abstract_excerpt":"We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance i"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the multi-hop questions are built from facts unlikely to have been observed together during pretraining, so that correct answers to the full question must come from composition rather than direct memorization of the combined fact.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Larger language models improve faster at single facts than at composing them, but self-ask prompting reduces the compositionality gap by forcing explicit intermediate questions.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Larger language models improve single-fact recall faster than they improve the ability to compose multiple facts into answers.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"76af897bafc5529bad8fa162060531c25baf0390f357b741305838bb6522eae3"},"source":{"id":"2210.03350","kind":"arxiv","version":3},"verdict":{"id":"c785ee43-207b-4c8f-b685-1dccaf41523a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T17:45:40.663909Z","strongest_claim":"In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease.","one_line_summary":"Larger language models improve faster at single facts than at composing them, but self-ask prompting reduces the compositionality gap by forcing explicit intermediate questions.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the multi-hop questions are built from facts unlikely to have been observed together during pretraining, so that correct answers to the full question must come from composition rather than direct memorization of the combined fact.","pith_extraction_headline":"Larger language models improve single-fact recall faster than they improve the ability to compose multiple facts into answers."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"6e82f68289ec3891e4b7f18d1b5f02a210155d166ef9bfd2cd6426aee7c33f25"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2210.03350","created_at":"2026-05-17T23:38:13.480034+00:00"},{"alias_kind":"arxiv_version","alias_value":"2210.03350v3","created_at":"2026-05-17T23:38:13.480034+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2210.03350","created_at":"2026-05-17T23:38:13.480034+00:00"},{"alias_kind":"pith_short_12","alias_value":"DJHSBN54HU2V","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"DJHSBN54HU2VV2B7","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"DJHSBN54","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":31,"internal_anchor_count":31,"sample":[{"citing_arxiv_id":"2605.22905","citing_title":"EVE-Agent: Evidence-Verifiable Self-Evolving Agents","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2211.09110","citing_title":"Holistic Evaluation of Language Models","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2505.04588","citing_title":"ZeroSearch: Incentivize the Search Capability of LLMs without Searching","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2509.23108","citing_title":"Artificial Phantasia: Emergent Mental Imagery in Large Language Models","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2510.16079","citing_title":"EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16675","citing_title":"LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2510.00568","citing_title":"ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2510.00861","citing_title":"Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2510.01685","citing_title":"How Do Language Models Compose Functions?","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2510.16079","citing_title":"EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2511.00066","citing_title":"Sharpness-Guided Group Relative Policy Optimization via Probability Shaping","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2309.11495","citing_title":"Chain-of-Verification Reduces Hallucination in Large Language Models","ref_index":163,"is_internal_anchor":true},{"citing_arxiv_id":"2511.02805","citing_title":"MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2511.09803","citing_title":"Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2505.04588","citing_title":"ZeroSearch: Incentivize the Search Capability of LLMs without Searching","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2601.12538","citing_title":"Agentic Reasoning for Large Language Models","ref_index":255,"is_internal_anchor":true},{"citing_arxiv_id":"2303.17491","citing_title":"Language Models can Solve Computer Tasks","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2303.09014","citing_title":"ART: Automatic multi-step reasoning and tool-use for large language models","ref_index":148,"is_internal_anchor":true},{"citing_arxiv_id":"2305.04091","citing_title":"Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2304.05376","citing_title":"ChemCrow: Augmenting large-language models with chemistry tools","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2504.13958","citing_title":"ToolRL: Reward is All Tool Learning Needs","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2503.05592","citing_title":"R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03675","citing_title":"OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2311.05232","citing_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","ref_index":259,"is_internal_anchor":true},{"citing_arxiv_id":"2310.11511","citing_title":"Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection","ref_index":86,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/DJHSBN54HU2VV2B7SIQRIVL5FE","json":"https://pith.science/pith/DJHSBN54HU2VV2B7SIQRIVL5FE.json","graph_json":"https://pith.science/api/pith-number/DJHSBN54HU2VV2B7SIQRIVL5FE/graph.json","events_json":"https://pith.science/api/pith-number/DJHSBN54HU2VV2B7SIQRIVL5FE/events.json","paper":"https://pith.science/paper/DJHSBN54"},"agent_actions":{"view_html":"https://pith.science/pith/DJHSBN54HU2VV2B7SIQRIVL5FE","download_json":"https://pith.science/pith/DJHSBN54HU2VV2B7SIQRIVL5FE.json","view_paper":"https://pith.science/paper/DJHSBN54","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2210.03350&json=true","fetch_graph":"https://pith.science/api/pith-number/DJHSBN54HU2VV2B7SIQRIVL5FE/graph.json","fetch_events":"https://pith.science/api/pith-number/DJHSBN54HU2VV2B7SIQRIVL5FE/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/DJHSBN54HU2VV2B7SIQRIVL5FE/action/timestamp_anchor","attest_storage":"https://pith.science/pith/DJHSBN54HU2VV2B7SIQRIVL5FE/action/storage_attestation","attest_author":"https://pith.science/pith/DJHSBN54HU2VV2B7SIQRIVL5FE/action/author_attestation","sign_citation":"https://pith.science/pith/DJHSBN54HU2VV2B7SIQRIVL5FE/action/citation_signature","submit_replication":"https://pith.science/pith/DJHSBN54HU2VV2B7SIQRIVL5FE/action/replication_record"}},"created_at":"2026-05-17T23:38:13.480034+00:00","updated_at":"2026-05-17T23:38:13.480034+00:00"}