{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:ZOHLDRDHWC72JFJ6655JF6NPE5","short_pith_number":"pith:ZOHLDRDH","schema_version":"1.0","canonical_sha256":"cb8eb1c467b0bfa4953ef77a92f9af2754c2cd0d73810d6e90d8cdb6db6d9aa3","source":{"kind":"arxiv","id":"2302.14045","version":2},"attestation_state":"computed","paper":{"title":"Language Is Not All You Need: Aligning Perception with Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning.","cross_cats":["cs.CV"],"primary_cat":"cs.CL","authors_text":"Barun Patra, Furu Wei, Johan Bjorck, Kriti Aggarwal, Lei Cui, Li Dong, Owais Khan Mohammed, Qiang Liu, Saksham Singhal, Shaohan Huang, Shuming Ma, Subhojit Som, Tengchao Lv, Vishrav Chaudhary, Wenhui Wang, Xia Song, Yaru Hao, Zewen Chi","submitted_at":"2023-02-27T18:55:27Z","abstract_excerpt":"A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2302.14045","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2023-02-27T18:55:27Z","cross_cats_sorted":["cs.CV"],"title_canon_sha256":"24cf17c4ee445514c2840626d8355b45ecf0094c3de6a6b61d7d413ce01507b1","abstract_canon_sha256":"7993a0954152dcd545e045cc24f137dd82711ecda0cf6977f227365be35946f8"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.584551Z","signature_b64":"yzdn9gyowe/jzc0DaoMNI5S/y1p8rGvq9L8vMdIIu3FSTWELdzoncP/ZJNX540F8/jGZmv1b/zoRpINlY1TgAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"cb8eb1c467b0bfa4953ef77a92f9af2754c2cd0d73810d6e90d8cdb6db6d9aa3","last_reissued_at":"2026-05-17T23:38:50.584005Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.584005Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Language Is Not All You Need: Aligning Perception with Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning.","cross_cats":["cs.CV"],"primary_cat":"cs.CL","authors_text":"Barun Patra, Furu Wei, Johan Bjorck, Kriti Aggarwal, Lei Cui, Li Dong, Owais Khan Mohammed, Qiang Liu, Saksham Singhal, Shaohan Huang, Shuming Ma, Subhojit Som, Tengchao Lv, Vishrav Chaudhary, Wenhui Wang, Xia Song, Yaru Hao, Zewen Chi","submitted_at":"2023-02-27T18:55:27Z","abstract_excerpt":"A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Kosmos-1 achieves impressive performance on language understanding, generation, OCR-free NLP, multimodal dialogue, image captioning, visual question answering, and vision tasks such as image recognition with descriptions, all without gradient updates or finetuning.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That web-scale multimodal corpora provide sufficient aligned signal for the model to acquire general cross-modal capabilities that transfer to held-out tasks without any task-specific adaptation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9742f48c92925033de4c965a4daa1bb558f625faf309c35c3116273eb5d73350"},"source":{"id":"2302.14045","kind":"arxiv","version":2},"verdict":{"id":"6456e510-b8bf-4a30-a815-3f51c57ef90d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T18:28:25.888757Z","strongest_claim":"Kosmos-1 achieves impressive performance on language understanding, generation, OCR-free NLP, multimodal dialogue, image captioning, visual question answering, and vision tasks such as image recognition with descriptions, all without gradient updates or finetuning.","one_line_summary":"Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That web-scale multimodal corpora provide sufficient aligned signal for the model to acquire general cross-modal capabilities that transfer to held-out tasks without any task-specific adaptation.","pith_extraction_headline":"Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning."},"references":{"count":33,"sample":[{"doi":"","year":null,"title":"Cm3: A causal masked multimodal model of the internet","work_id":"a4a6d3b6-13f5-437f-8081-765dd23198b9","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects","work_id":"cda64168-85ba-41dd-8410-4f6f43755aa5","ref_index":2,"cited_arxiv_id":"1602.00753","is_internal_anchor":true},{"doi":"","year":1901,"title":"Language models are few-shot learners","work_id":"85340504-8b97-4876-b757-de68441ee4ff","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"BoolQ: Exploring the surprising difﬁculty of natural yes/no questions","work_id":"f9c11f6d-5c94-42d0-bfee-9f3a6d3a703b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","ref_index":5,"cited_arxiv_id":"2204.02311","is_internal_anchor":true}],"resolved_work":33,"snapshot_sha256":"f9d0d15096905a2cd5ff5193521c1b2423b8f85966338566bc90a16d6d37d567","internal_anchors":13},"formal_canon":{"evidence_count":3,"snapshot_sha256":"5e34ab46d29ada74852b360479b3bf59ccdd895a48bfe1e92e58c5d8f85c4713"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2302.14045","created_at":"2026-05-17T23:38:50.584105+00:00"},{"alias_kind":"arxiv_version","alias_value":"2302.14045v2","created_at":"2026-05-17T23:38:50.584105+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2302.14045","created_at":"2026-05-17T23:38:50.584105+00:00"},{"alias_kind":"pith_short_12","alias_value":"ZOHLDRDHWC72","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"ZOHLDRDHWC72JFJ6","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"ZOHLDRDH","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":30,"internal_anchor_count":30,"sample":[{"citing_arxiv_id":"2505.17015","citing_title":"Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19859","citing_title":"Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2507.01201","citing_title":"Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17152","citing_title":"Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19859","citing_title":"Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15300","citing_title":"Deep Pre-Alignment for VLMs","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2311.04257","citing_title":"mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2311.17005","citing_title":"MVBench: A Comprehensive Multi-modal Video Understanding Benchmark","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2303.08128","citing_title":"ViperGPT: Visual Inference via Python Execution for Reasoning","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2305.18565","citing_title":"PaLI-X: On Scaling up a Multilingual Vision and Language Model","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2311.07575","citing_title":"SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2309.17421","citing_title":"The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2307.06942","citing_title":"InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2303.16199","citing_title":"LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2308.01390","citing_title":"OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2303.11381","citing_title":"MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2303.17580","citing_title":"HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2305.06355","citing_title":"VideoChat: Chat-Centric Video Understanding","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2403.09631","citing_title":"3D-VLA: A 3D Vision-Language-Action Generative World Model","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2311.05232","citing_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","ref_index":128,"is_internal_anchor":true},{"citing_arxiv_id":"2306.14824","citing_title":"Kosmos-2: Grounding Multimodal Large Language Models to the World","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2307.08621","citing_title":"Retentive Network: A Successor to Transformer for Large Language Models","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2402.06196","citing_title":"Large Language Models: A Survey","ref_index":98,"is_internal_anchor":true},{"citing_arxiv_id":"2309.07864","citing_title":"The Rise and Potential of Large Language Model Based Agents: A Survey","ref_index":287,"is_internal_anchor":true},{"citing_arxiv_id":"2304.08485","citing_title":"Visual Instruction Tuning","ref_index":20,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/ZOHLDRDHWC72JFJ6655JF6NPE5","json":"https://pith.science/pith/ZOHLDRDHWC72JFJ6655JF6NPE5.json","graph_json":"https://pith.science/api/pith-number/ZOHLDRDHWC72JFJ6655JF6NPE5/graph.json","events_json":"https://pith.science/api/pith-number/ZOHLDRDHWC72JFJ6655JF6NPE5/events.json","paper":"https://pith.science/paper/ZOHLDRDH"},"agent_actions":{"view_html":"https://pith.science/pith/ZOHLDRDHWC72JFJ6655JF6NPE5","download_json":"https://pith.science/pith/ZOHLDRDHWC72JFJ6655JF6NPE5.json","view_paper":"https://pith.science/paper/ZOHLDRDH","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2302.14045&json=true","fetch_graph":"https://pith.science/api/pith-number/ZOHLDRDHWC72JFJ6655JF6NPE5/graph.json","fetch_events":"https://pith.science/api/pith-number/ZOHLDRDHWC72JFJ6655JF6NPE5/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/ZOHLDRDHWC72JFJ6655JF6NPE5/action/timestamp_anchor","attest_storage":"https://pith.science/pith/ZOHLDRDHWC72JFJ6655JF6NPE5/action/storage_attestation","attest_author":"https://pith.science/pith/ZOHLDRDHWC72JFJ6655JF6NPE5/action/author_attestation","sign_citation":"https://pith.science/pith/ZOHLDRDHWC72JFJ6655JF6NPE5/action/citation_signature","submit_replication":"https://pith.science/pith/ZOHLDRDHWC72JFJ6655JF6NPE5/action/replication_record"}},"created_at":"2026-05-17T23:38:50.584105+00:00","updated_at":"2026-05-17T23:38:50.584105+00:00"}