{"paper":{"title":"Language Is Not All You Need: Aligning Perception with Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning.","cross_cats":["cs.CV"],"primary_cat":"cs.CL","authors_text":"Barun Patra, Furu Wei, Johan Bjorck, Kriti Aggarwal, Lei Cui, Li Dong, Owais Khan Mohammed, Qiang Liu, Saksham Singhal, Shaohan Huang, Shuming Ma, Subhojit Som, Tengchao Lv, Vishrav Chaudhary, Wenhui Wang, Xia Song, Yaru Hao, Zewen Chi","submitted_at":"2023-02-27T18:55:27Z","abstract_excerpt":"A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Kosmos-1 achieves impressive performance on language understanding, generation, OCR-free NLP, multimodal dialogue, image captioning, visual question answering, and vision tasks such as image recognition with descriptions, all without gradient updates or finetuning.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That web-scale multimodal corpora provide sufficient aligned signal for the model to acquire general cross-modal capabilities that transfer to held-out tasks without any task-specific adaptation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9742f48c92925033de4c965a4daa1bb558f625faf309c35c3116273eb5d73350"},"source":{"id":"2302.14045","kind":"arxiv","version":2},"verdict":{"id":"6456e510-b8bf-4a30-a815-3f51c57ef90d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T18:28:25.888757Z","strongest_claim":"Kosmos-1 achieves impressive performance on language understanding, generation, OCR-free NLP, multimodal dialogue, image captioning, visual question answering, and vision tasks such as image recognition with descriptions, all without gradient updates or finetuning.","one_line_summary":"Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That web-scale multimodal corpora provide sufficient aligned signal for the model to acquire general cross-modal capabilities that transfer to held-out tasks without any task-specific adaptation.","pith_extraction_headline":"Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning."},"references":{"count":33,"sample":[{"doi":"","year":null,"title":"Cm3: A causal masked multimodal model of the internet","work_id":"a4a6d3b6-13f5-437f-8081-765dd23198b9","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects","work_id":"cda64168-85ba-41dd-8410-4f6f43755aa5","ref_index":2,"cited_arxiv_id":"1602.00753","is_internal_anchor":true},{"doi":"","year":1901,"title":"Language models are few-shot learners","work_id":"85340504-8b97-4876-b757-de68441ee4ff","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"BoolQ: Exploring the surprising difﬁculty of natural yes/no questions","work_id":"f9c11f6d-5c94-42d0-bfee-9f3a6d3a703b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","ref_index":5,"cited_arxiv_id":"2204.02311","is_internal_anchor":true}],"resolved_work":33,"snapshot_sha256":"f9d0d15096905a2cd5ff5193521c1b2423b8f85966338566bc90a16d6d37d567","internal_anchors":13},"formal_canon":{"evidence_count":3,"snapshot_sha256":"5e34ab46d29ada74852b360479b3bf59ccdd895a48bfe1e92e58c5d8f85c4713"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}