{"paper":{"title":"CogVLM: Visual Expert for Pretrained Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A trainable visual expert module inserted into the attention and FFN layers of a frozen language model enables deep vision-language fusion.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bin Xu, Jiazheng Xu, Jie Tang, Ji Qi, Juanzi Li, Junhui Ji, Lei Zhao, Ming Ding, Qingsong Lv, Weihan Wang, Wenmeng Yu, Wenyi Hong, Xixuan Song, Yan Wang, Yuxiao Dong, Zhuoyi Yang","submitted_at":"2023-11-06T13:04:39Z","abstract_excerpt":"We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, Ref"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks... surpassing or matching PaLI-X 55B.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The visual expert module can be inserted into the attention and FFN layers of any frozen pretrained language model without requiring changes to the original architecture or loss functions.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A trainable visual expert module inserted into the attention and FFN layers of a frozen language model enables deep vision-language fusion.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"be90eb575bcab3443ad5b5eeab07951b05affae30c186172073f5f508a425f23"},"source":{"id":"2311.03079","kind":"arxiv","version":2},"verdict":{"id":"abe2f049-9015-4958-aba1-fb6f3eaacd7b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T15:41:06.257046Z","strongest_claim":"CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks... surpassing or matching PaLI-X 55B.","one_line_summary":"CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The visual expert module can be inserted into the attention and FFN layers of any frozen pretrained language model without requiring changes to the original architecture or loss functions.","pith_extraction_headline":"A trainable visual expert module inserted into the attention and FFN layers of a frozen language model enables deep vision-language fusion."},"references":{"count":33,"sample":[{"doi":"","year":null,"title":"OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models","work_id":"87bfa84a-e663-4165-806f-93ef439d88d0","ref_index":1,"cited_arxiv_id":"2308.01390","is_internal_anchor":true},{"doi":"","year":null,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":2,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":1989,"title":"Murel: Multimodal relational reasoning for visual ques- tion answering","work_id":"49d68897-f597-43d7-b9ab-c6810ce5a8f3","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic","work_id":"44525076-312a-4259-b79c-134cd7eeb297","ref_index":4,"cited_arxiv_id":"2306.15195","is_internal_anchor":true},{"doi":"","year":null,"title":"Universal captioner: Long-tail vision-and-language model training through content-style separation.arXiv preprint arXiv:2111.12727,","work_id":"8147134b-8245-480c-a294-d5382f4aa9aa","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":33,"snapshot_sha256":"930bafb70a094ebdbb2dfd3b02d5967ccde3eb04c48ea277d8612c0af9534ebb","internal_anchors":17},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b7d08736e454ae758a45db8623526339624dff559713aac366aeccd11d30943f"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}