{"paper":{"title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"InternLM-XComposer-2.5 reaches GPT-4V level on vision-language tasks with a 7B model and 96K context support.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Bin Wang, Conghui He, Dahua Lin, Hang Yan, Haodong Duan, Jiaqi Wang, Jifeng Dai, Jingwen Li, Kai Chen, Lin Chen, Linke Ouyang, Pan Zhang, Peng Sun, Qipeng Guo, Rui Qian, Songyang Zhang, Wei Li, Wenhai Wang, Wenwei Zhang, Xiaoyi Dong, Xingcheng Zhang, Xinyue Zhang, Yang Gao, Yining Li, Yuhang Cao, Yuhang Zang, Yu Qiao","submitted_at":"2024-07-03T17:59:21Z","abstract_excerpt":"We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend... outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 28 chosen benchmarks and the specific 16 key tasks are representative of real-world use and that RoPE extrapolation from 24K training to 96K inference does not introduce hidden degradation on long outputs.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"InternLM-XComposer-2.5 reaches GPT-4V level on vision-language tasks with a 7B model and 96K context support.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5232fbb4d688d62e55a0d633f8ed131d696cd383dd58da4551894e95b7de4068"},"source":{"id":"2407.03320","kind":"arxiv","version":1},"verdict":{"id":"2753d278-4a63-48a0-9867-b56f079b2334","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T10:41:59.051328Z","strongest_claim":"IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend... outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.","one_line_summary":"InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 28 chosen benchmarks and the specific 16 key tasks are representative of real-world use and that RoPE extrapolation from 24K training to 96K inference does not introduce hidden degradation on long outputs.","pith_extraction_headline":"InternLM-XComposer-2.5 reaches GPT-4V level on vision-language tasks with a 7B model and 96K context support."},"references":{"count":183,"sample":[{"doi":"","year":2019,"title":"Nocaps: Novel object captioning at scale","work_id":"32bf5252-aa77-411b-b9df-5ea51e0be909","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Flamingo: a visual language model for few-shot learning,","work_id":"15887c25-c51f-4381-9fe0-7afe4a3002b7","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Claude 3 haiku: our fastest model yet,","work_id":"2b2c4fa3-164a-40fa-a5c0-0ac5851720e6","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Available at: https://www.anthropic.com/ news/claude-3-haiku. 1, 8","work_id":"4d627030-9afe-43c8-aa8b-0c4fd374025f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2015,"title":"Lawrence Zitnick, and Devi Parikh","work_id":"200fccfb-5483-49fd-8257-589662fe71a7","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":183,"snapshot_sha256":"c83da670fbcdeef0ccca461f216f28e62443b28f996363fbc962f28c6dae5108","internal_anchors":36},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c67a7876b4ffa359df5df1c57b5e84bb7c6c9c36e4846670e862892b3cc1beb6"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}