{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2022:XYRW52VEY4L236AVZCPU5VWAUR","short_pith_number":"pith:XYRW52VE","schema_version":"1.0","canonical_sha256":"be236eeaa4c717adf815c89f4ed6c0a4718b32bca204501fa8988d1d841daea7","source":{"kind":"arxiv","id":"2204.00598","version":2},"attestation_state":"computed","paper":{"title":"Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Pretrained models can be composed zero-shot through multimodal prompting to exchange information and gain new multimodal capabilities without finetuning.","cross_cats":["cs.AI","cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Adrian Wong, Andy Zeng, Aveek Purohit, Brian Ichter, Federico Tombari, Johnny Lee, Krzysztof Choromanski, Maria Attarian, Michael Ryoo, Pete Florence, Stefan Welker, Vikas Sindhwani, Vincent Vanhoucke","submitted_at":"2022-04-01T17:43:13Z","abstract_excerpt":"Large pretrained (e.g., \"foundation\") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Soc"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2204.00598","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2022-04-01T17:43:13Z","cross_cats_sorted":["cs.AI","cs.CL","cs.LG"],"title_canon_sha256":"6c6ab2fbb224c4add86a606e514200ffee80459270e2a6a79154866871bd8d58","abstract_canon_sha256":"577a8a3b32c6a40980cc97108847a1f07fcfe408cc83e3df4277ff16066621c8"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.286915Z","signature_b64":"vTmgq/o7/YhS3PIMj2XVRTwzOaZVRkbw+K6QjXkb+NjGJQYYJiA1VIFHNOwnEZYqW9fJzRhZA4Z0m/Ezy/21Bg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"be236eeaa4c717adf815c89f4ed6c0a4718b32bca204501fa8988d1d841daea7","last_reissued_at":"2026-05-17T23:38:48.286458Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.286458Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Pretrained models can be composed zero-shot through multimodal prompting to exchange information and gain new multimodal capabilities without finetuning.","cross_cats":["cs.AI","cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Adrian Wong, Andy Zeng, Aveek Purohit, Brian Ichter, Federico Tombari, Johnny Lee, Krzysztof Choromanski, Maria Attarian, Michael Ryoo, Pete Florence, Stefan Welker, Vikas Sindhwani, Vincent Vanhoucke","submitted_at":"2022-04-01T17:43:13Z","abstract_excerpt":"Large pretrained (e.g., \"foundation\") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Soc"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That distinct capabilities stored in separately trained foundation models can be reliably accessed and combined through prompting alone, without finetuning or task-specific adaptation that would break the zero-shot property.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Pretrained models can be composed zero-shot through multimodal prompting to exchange information and gain new multimodal capabilities without finetuning.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"2568b27e60b3567fee0fd364825f20cf2e7d8c6e20ed0c83fbecb876404b7e35"},"source":{"id":"2204.00598","kind":"arxiv","version":2},"verdict":{"id":"22b02b1a-d09a-484c-8fd7-279b4d5a7b37","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T09:46:44.327244Z","strongest_claim":"multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning","one_line_summary":"Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That distinct capabilities stored in separately trained foundation models can be reliably accessed and combined through prompting alone, without finetuning or task-specific adaptation that would break the zero-shot property.","pith_extraction_headline":"Pretrained models can be composed zero-shot through multimodal prompting to exchange information and gain new multimodal capabilities without finetuning."},"references":{"count":142,"sample":[{"doi":"","year":2018,"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","ref_index":1,"cited_arxiv_id":"1810.04805","is_internal_anchor":true},{"doi":"","year":1901,"title":"T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information process","work_id":"c20b01b4-6073-488b-bbd5-523f804a51ca","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In Internati","work_id":"4182da69-5550-4207-b042-4cb58eef7a03","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","ref_index":4,"cited_arxiv_id":"2108.07258","is_internal_anchor":true},{"doi":"","year":2021,"title":"J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processi","work_id":"bd7ee6a3-07e7-49d7-9422-5de7f546009d","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":142,"snapshot_sha256":"63b17ca94639d0f62a5e8ab2865ae994d858e8156328706efb06b91d0e48cbf4","internal_anchors":17},"formal_canon":{"evidence_count":1,"snapshot_sha256":"d317bdf003f58f7983cb9b97552b3983b8acae9e7f777d4397c216f168f64197"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2204.00598","created_at":"2026-05-17T23:38:48.286538+00:00"},{"alias_kind":"arxiv_version","alias_value":"2204.00598v2","created_at":"2026-05-17T23:38:48.286538+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2204.00598","created_at":"2026-05-17T23:38:48.286538+00:00"},{"alias_kind":"pith_short_12","alias_value":"XYRW52VEY4L2","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"XYRW52VEY4L236AV","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"XYRW52VE","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2412.09176","citing_title":"LIVE-GS: LLM Powers Interactive VR Experience with Physics-Aware Gaussian Splatting","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21261","citing_title":"STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2506.04565","citing_title":"From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems","ref_index":223,"is_internal_anchor":true},{"citing_arxiv_id":"2509.16615","citing_title":"LLM-Guided Task- and Affordance-Level Exploration in Reinforcement Learning","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2511.12676","citing_title":"BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2310.14566","citing_title":"HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2205.14100","citing_title":"GIT: A Generative Image-to-text Transformer for Vision and Language","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2309.02427","citing_title":"Cognitive Architectures for Language Agents","ref_index":93,"is_internal_anchor":true},{"citing_arxiv_id":"2309.16671","citing_title":"Demystifying CLIP Data","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2302.01560","citing_title":"Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2306.13549","citing_title":"A Survey on Multimodal Large Language Models","ref_index":205,"is_internal_anchor":true},{"citing_arxiv_id":"2304.08244","citing_title":"API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2205.01917","citing_title":"CoCa: Contrastive Captioners are Image-Text Foundation Models","ref_index":70,"is_internal_anchor":true},{"citing_arxiv_id":"2209.07753","citing_title":"Code as Policies: Language Model Programs for Embodied Control","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2303.11381","citing_title":"MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2303.04671","citing_title":"Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2307.05973","citing_title":"VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2311.12983","citing_title":"GAIA: a benchmark for General AI Assistants","ref_index":71,"is_internal_anchor":true},{"citing_arxiv_id":"2204.14198","citing_title":"Flamingo: a Visual Language Model for Few-Shot Learning","ref_index":145,"is_internal_anchor":true},{"citing_arxiv_id":"2305.14325","citing_title":"Improving Factuality and Reasoning in Language Models through Multiagent Debate","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2207.05608","citing_title":"Inner Monologue: Embodied Reasoning through Planning with Language Models","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02130","citing_title":"From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21718","citing_title":"Building a Precise Video Language with Human-AI Oversight","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12896","citing_title":"Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2206.07682","citing_title":"Emergent Abilities of Large Language Models","ref_index":99,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/XYRW52VEY4L236AVZCPU5VWAUR","json":"https://pith.science/pith/XYRW52VEY4L236AVZCPU5VWAUR.json","graph_json":"https://pith.science/api/pith-number/XYRW52VEY4L236AVZCPU5VWAUR/graph.json","events_json":"https://pith.science/api/pith-number/XYRW52VEY4L236AVZCPU5VWAUR/events.json","paper":"https://pith.science/paper/XYRW52VE"},"agent_actions":{"view_html":"https://pith.science/pith/XYRW52VEY4L236AVZCPU5VWAUR","download_json":"https://pith.science/pith/XYRW52VEY4L236AVZCPU5VWAUR.json","view_paper":"https://pith.science/paper/XYRW52VE","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2204.00598&json=true","fetch_graph":"https://pith.science/api/pith-number/XYRW52VEY4L236AVZCPU5VWAUR/graph.json","fetch_events":"https://pith.science/api/pith-number/XYRW52VEY4L236AVZCPU5VWAUR/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/XYRW52VEY4L236AVZCPU5VWAUR/action/timestamp_anchor","attest_storage":"https://pith.science/pith/XYRW52VEY4L236AVZCPU5VWAUR/action/storage_attestation","attest_author":"https://pith.science/pith/XYRW52VEY4L236AVZCPU5VWAUR/action/author_attestation","sign_citation":"https://pith.science/pith/XYRW52VEY4L236AVZCPU5VWAUR/action/citation_signature","submit_replication":"https://pith.science/pith/XYRW52VEY4L236AVZCPU5VWAUR/action/replication_record"}},"created_at":"2026-05-17T23:38:48.286538+00:00","updated_at":"2026-05-17T23:38:48.286538+00:00"}