{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:T3R2WYDILTDHFB22LT4M6D44D3","short_pith_number":"pith:T3R2WYDI","schema_version":"1.0","canonical_sha256":"9ee3ab60685cc672875a5cf8cf0f9c1ec15b3f02177cf550807d3b7ab251300e","source":{"kind":"arxiv","id":"2407.03320","version":1},"attestation_state":"computed","paper":{"title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"InternLM-XComposer-2.5 reaches GPT-4V level on vision-language tasks with a 7B model and 96K context support.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Bin Wang, Conghui He, Dahua Lin, Hang Yan, Haodong Duan, Jiaqi Wang, Jifeng Dai, Jingwen Li, Kai Chen, Lin Chen, Linke Ouyang, Pan Zhang, Peng Sun, Qipeng Guo, Rui Qian, Songyang Zhang, Wei Li, Wenhai Wang, Wenwei Zhang, Xiaoyi Dong, Xingcheng Zhang, Xinyue Zhang, Yang Gao, Yining Li, Yuhang Cao, Yuhang Zang, Yu Qiao","submitted_at":"2024-07-03T17:59:21Z","abstract_excerpt":"We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2407.03320","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2024-07-03T17:59:21Z","cross_cats_sorted":["cs.CL"],"title_canon_sha256":"38e695c3ae3d470f400cb2e8ab0933bd36b3e26713f77856af17cbb4736facd1","abstract_canon_sha256":"21cceb9d462163087b0dca8e7bb289e0afc7fcd632313d0b62ce244763f889b9"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.327933Z","signature_b64":"yqM7rTwCM0OrKUGFNMqb+zjKYW+/7QyJKfMs5wtGfS1V3dl6TINAZeK5ucBbzCLAWavGSmWEC+kOZrBA6svFDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"9ee3ab60685cc672875a5cf8cf0f9c1ec15b3f02177cf550807d3b7ab251300e","last_reissued_at":"2026-05-17T23:38:14.327329Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.327329Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"InternLM-XComposer-2.5 reaches GPT-4V level on vision-language tasks with a 7B model and 96K context support.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Bin Wang, Conghui He, Dahua Lin, Hang Yan, Haodong Duan, Jiaqi Wang, Jifeng Dai, Jingwen Li, Kai Chen, Lin Chen, Linke Ouyang, Pan Zhang, Peng Sun, Qipeng Guo, Rui Qian, Songyang Zhang, Wei Li, Wenhai Wang, Wenwei Zhang, Xiaoyi Dong, Xingcheng Zhang, Xinyue Zhang, Yang Gao, Yining Li, Yuhang Cao, Yuhang Zang, Yu Qiao","submitted_at":"2024-07-03T17:59:21Z","abstract_excerpt":"We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend... outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 28 chosen benchmarks and the specific 16 key tasks are representative of real-world use and that RoPE extrapolation from 24K training to 96K inference does not introduce hidden degradation on long outputs.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"InternLM-XComposer-2.5 reaches GPT-4V level on vision-language tasks with a 7B model and 96K context support.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5232fbb4d688d62e55a0d633f8ed131d696cd383dd58da4551894e95b7de4068"},"source":{"id":"2407.03320","kind":"arxiv","version":1},"verdict":{"id":"2753d278-4a63-48a0-9867-b56f079b2334","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T10:41:59.051328Z","strongest_claim":"IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend... outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.","one_line_summary":"InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 28 chosen benchmarks and the specific 16 key tasks are representative of real-world use and that RoPE extrapolation from 24K training to 96K inference does not introduce hidden degradation on long outputs.","pith_extraction_headline":"InternLM-XComposer-2.5 reaches GPT-4V level on vision-language tasks with a 7B model and 96K context support."},"references":{"count":183,"sample":[{"doi":"","year":2019,"title":"Nocaps: Novel object captioning at scale","work_id":"32bf5252-aa77-411b-b9df-5ea51e0be909","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Flamingo: a visual language model for few-shot learning,","work_id":"15887c25-c51f-4381-9fe0-7afe4a3002b7","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Claude 3 haiku: our fastest model yet,","work_id":"2b2c4fa3-164a-40fa-a5c0-0ac5851720e6","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Available at: https://www.anthropic.com/ news/claude-3-haiku. 1, 8","work_id":"4d627030-9afe-43c8-aa8b-0c4fd374025f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2015,"title":"Lawrence Zitnick, and Devi Parikh","work_id":"200fccfb-5483-49fd-8257-589662fe71a7","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":183,"snapshot_sha256":"c83da670fbcdeef0ccca461f216f28e62443b28f996363fbc962f28c6dae5108","internal_anchors":36},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c67a7876b4ffa359df5df1c57b5e84bb7c6c9c36e4846670e862892b3cc1beb6"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2407.03320","created_at":"2026-05-17T23:38:14.327452+00:00"},{"alias_kind":"arxiv_version","alias_value":"2407.03320v1","created_at":"2026-05-17T23:38:14.327452+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2407.03320","created_at":"2026-05-17T23:38:14.327452+00:00"},{"alias_kind":"pith_short_12","alias_value":"T3R2WYDILTDH","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"T3R2WYDILTDHFB22","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"T3R2WYDI","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":20,"internal_anchor_count":20,"sample":[{"citing_arxiv_id":"2508.10016","citing_title":"Training-Free Multimodal Large Language Model Orchestration","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22078","citing_title":"Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2408.04840","citing_title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models","ref_index":267,"is_internal_anchor":true},{"citing_arxiv_id":"2508.10016","citing_title":"Training-Free Multimodal Large Language Model Orchestration","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2502.04326","citing_title":"WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs","ref_index":82,"is_internal_anchor":true},{"citing_arxiv_id":"2410.17247","citing_title":"PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2409.17146","citing_title":"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models","ref_index":133,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13803","citing_title":"EvoGround: Self-Evolving Video Agents for Video Temporal Grounding","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2409.02813","citing_title":"MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2503.01785","citing_title":"Visual-RFT: Visual Reinforcement Fine-Tuning","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2506.15564","citing_title":"Show-o2: Improved Native Unified Multimodal Models","ref_index":142,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08896","citing_title":"GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07562","citing_title":"Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2410.02713","citing_title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","ref_index":153,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04500","citing_title":"Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward","ref_index":90,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13804","citing_title":"Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2408.03326","citing_title":"LLaVA-OneVision: Easy Visual Task Transfer","ref_index":162,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13565","citing_title":"UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17052","citing_title":"OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18512","citing_title":"S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models","ref_index":131,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/T3R2WYDILTDHFB22LT4M6D44D3","json":"https://pith.science/pith/T3R2WYDILTDHFB22LT4M6D44D3.json","graph_json":"https://pith.science/api/pith-number/T3R2WYDILTDHFB22LT4M6D44D3/graph.json","events_json":"https://pith.science/api/pith-number/T3R2WYDILTDHFB22LT4M6D44D3/events.json","paper":"https://pith.science/paper/T3R2WYDI"},"agent_actions":{"view_html":"https://pith.science/pith/T3R2WYDILTDHFB22LT4M6D44D3","download_json":"https://pith.science/pith/T3R2WYDILTDHFB22LT4M6D44D3.json","view_paper":"https://pith.science/paper/T3R2WYDI","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2407.03320&json=true","fetch_graph":"https://pith.science/api/pith-number/T3R2WYDILTDHFB22LT4M6D44D3/graph.json","fetch_events":"https://pith.science/api/pith-number/T3R2WYDILTDHFB22LT4M6D44D3/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/T3R2WYDILTDHFB22LT4M6D44D3/action/timestamp_anchor","attest_storage":"https://pith.science/pith/T3R2WYDILTDHFB22LT4M6D44D3/action/storage_attestation","attest_author":"https://pith.science/pith/T3R2WYDILTDHFB22LT4M6D44D3/action/author_attestation","sign_citation":"https://pith.science/pith/T3R2WYDILTDHFB22LT4M6D44D3/action/citation_signature","submit_replication":"https://pith.science/pith/T3R2WYDILTDHFB22LT4M6D44D3/action/replication_record"}},"created_at":"2026-05-17T23:38:14.327452+00:00","updated_at":"2026-05-17T23:38:14.327452+00:00"}