{"work":{"id":"f3aac728-ded0-4e55-aa9e-4a1635d4313d","openalex_id":null,"doi":null,"arxiv_id":"2305.06500","raw_key":null,"title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","authors":null,"authors_text":"Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang","year":2023,"venue":"cs.CV","abstract":"Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.","external_url":"https://arxiv.org/abs/2305.06500","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-24T08:14:10.177899+00:00","pith_arxiv_id":"2305.06500","created_at":"2026-05-09T23:54:45.641735+00:00","updated_at":"2026-05-24T08:14:10.177899+00:00","title_quality_ok":true,"display_title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","render_title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning"},"hub":{"state":{"work_id":"f3aac728-ded0-4e55-aa9e-4a1635d4313d","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":76,"external_cited_by_count":null,"distinct_field_count":9,"first_pith_cited_at":"2023-03-28T17:59:12+00:00","last_pith_cited_at":"2026-05-17T09:21:11+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-02T11:04:35.297919+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":15},{"context_role":"baseline","n":5},{"context_role":"method","n":5},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":13},{"context_polarity":"baseline","n":5},{"context_polarity":"use_method","n":5},{"context_polarity":"support","n":1},{"context_polarity":"unclear","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-17T05:59:42.267731+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":28},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":22},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":22},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":22},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":21},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":21},{"title":"Evaluating Object Hallucination in Large Vision-Language Models","work_id":"66d8ac3e-c134-4995-b528-550afa17586f","shared_citers":17},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":17},{"title":"mplug-owl: Modularization empowers large lan- guage models with multimodality","work_id":"74a7deb6-48be-4132-9d35-882cc5870ebd","shared_citers":17},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":15},{"title":"MMBench: Is Your Multi-modal Model an All-around Player?","work_id":"3b44943d-0f15-4228-9ac3-0e376f4f9ada","shared_citers":15},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":13},{"title":"Otter: A Multi-Modal Model with In-Context Instruction Tuning","work_id":"33cb3a7a-6091-48db-a246-802bbb055f43","shared_citers":12},{"title":"LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model","work_id":"0fe2cfd8-d442-4ceb-b1a9-a465704f39b2","shared_citers":11},{"title":"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic","work_id":"44525076-312a-4259-b79c-134cd7eeb297","shared_citers":11},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":10},{"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","shared_citers":10},{"title":"MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities","work_id":"7f3bac41-a0a5-4a7a-bfd2-526b616db745","shared_citers":9},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":9},{"title":"Kosmos-2: Grounding Multimodal Large Language Models to the World","work_id":"46e7f9e9-24c6-49af-b7d5-96159fa6f443","shared_citers":8},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":8},{"title":"Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning","work_id":"ba8e8164-e47f-42d6-83ad-696cb57ee79a","shared_citers":8},{"title":"Multimodal-gpt: A vision and language model for dialogue with humans","work_id":"e5fb1f2e-4ed2-454f-87a3-9e9c40f8fa31","shared_citers":8},{"title":"PandaGPT: One Model To Instruction-Follow Them All","work_id":"b3689b4d-65c2-45ae-84da-01b291742486","shared_citers":8}],"time_series":[{"n":20,"year":2023},{"n":9,"year":2024},{"n":4,"year":2025},{"n":14,"year":2026}],"dependency_candidates":[{"n":1,"role":"baseline","polarity":"baseline","paper_title":"CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning","primary_cat":"cs.CV","context_text":"Figure 1: CoME-VL uses token entropy analysis to identify complementary layer ranges from multiple vision encoders (SigLIP2 and DINOv3). By composing all SigLIP2 layers (which exhibit high entropy, capturing diverse semantic features) with the low-entropy DINOv3 layers 10-23 (which encode strong spatial features), CoME-VL achieves consistent improvements over the Molmo [15] baseline (single-encoder), averaging +4.9% on visual understanding/generation and +5.4% on grounding tasks. Abstract Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual en-","citing_arxiv_id":"2604.03231"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"Hallucination of Multimodal Large Language Models: A Survey","primary_cat":"cs.CV","context_text":"Hal-EvalIn-domainEvent. F1↑ Hal-EvalOut-of-domainEvent. F1↑ mPLUG-Owl [184] 7B 120.00 50.00 50.00 55.00 68.06 66.79 66.82 69.3 31.2 54.1 47 46.6ImageBind-LLM [54] 7B 128.33 60.00 46.67 73.33 - - - - - - - -InstructBLIP [36] (7B) 7B - - - - - - - 89.1 82.6 86.2 66.2 66.6InstructBLIP [36] (13B) 13B 185.00 143.33 66.67 153.33 89.29 83.45 78.45 84.7 - - - -VisualGLM-6B [39] 6B 85.00 50.00 48.33 55.00 - - - - - - - -Multimodal-GPT [47] 7B 61.67 55.00 58.33 68.33 66.68 66.67 66.67 - - - - -PandaGPT [147] 7B 70.00 50.00 50.00 50.00 - - - - - - - -LaVIN [118] 13B 185.00 88.33 63.33 75.00 - - - - - - - -Cheetor [99] 7B 180.00 96.67 80.00 116.67 - - - - - - - -GPT-4V [125] - 190.00 160.00 95.00 150.00 - - - - 89.6 92.7 - -LLaVA [111] (7B) 7B - - - - - - - 73.","citing_arxiv_id":"2404.18930"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Scaling Rectified Flow Transformers for High-Resolution Image Synthesis","primary_cat":"cs.CV","context_text":"Details on Simulation-Free Training of Flows Following (Lipman et al., 2023), to see that ut(z) generates pt, we note that the continuity equation provides a necessary and sufficient condition (Villani, 2008): d dt pt(x) + ∇ · [pt(x)vt(x)] = 0 ↔ vt generates probability density path pt. (26) Therefore it suffices to show that −∇ · [ut(z)pt(z)] = −∇ · [Eϵ∼N(0,I)ut(z|ϵ) pt(z|ϵ) pt(z) pt(z)] (27) = Eϵ∼N(0,I) − ∇ · [ut(z|ϵ)pt(z|ϵ)] (28) = Eϵ∼N(0,I) d dt pt(z|ϵ) = d dt pt(z), (29) where we used the continuity equation Equation (26) for ut(z|ϵ) in line Equation (28) to Equation (29) since ut(z|ϵ) generates pt(z|ϵ) and the definition of Equation (6) in line Equation (27) The equivalence of objectives LF M ⇋ LCF M (Lipman et al., 2023) follows from","citing_arxiv_id":"2403.03206"},{"n":1,"role":"method","polarity":"use_method","paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","context_text":"variant 4 2496 48 19968 39 5985M 1553G 28.3 / 65.3 65.9 variant 5 2816 64 11264 44 6095M 1589G 21.6 / 61.4 66.2 variant 6 2496 80 9984 39 5985M 1564G 16.9 / 60.1 66.2 Table 11. Comparison of hyperparameters in InternViT-6B. The throughput (img/s) and GFLOPs are measured at 224×224 in- put resolution, with a batch size of 1 or 128 on a single A100 GPU. Flash Attention [35] and bf16 precision are used during testing. \"zs IN\" denotes the zero-shot top-1 accuracy on the ImageNet-1K validation set [38]. The final selected model is marked in gray . visual glue LLM dataset dialogue caption visual question answering encoder layer MME NoCaps OKVQA VizWizval GQA EV A-E MLP V-7B 665K [91] 970.5 75.1 40.1 25.5 41.3 IViT-6B MLP V-7B 665K [91] 1022.","citing_arxiv_id":"2312.14238"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension","primary_cat":"cs.CL","context_text":"averaged performance on three dimensions for evaluating temporal understanding. Model Type Model Language Model Spatial Temporal Overall Acc Rank Acc Rank Acc Rank LLM Flan-T5 [1] Flan-T5-XL 27.32 17 28.56 11 27.65 17 Vicuna [4] Vicuna-7B 28.16 16 29.46 8 28.50 16 LLaMA [5] LLaMA-7B 26.56 18 27.27 13 26.75 18 ImageLLM BLIP-2 [6] Flan-T5-XL 49.74 3 36.71 3 46.35 3 InstructBLIP [10] Flan-T5-XL 57.80 2 38.31 1 52.73 2 InstructBLIP Vicuna [10] Vicuna-7B 58.76 1 38.05 2 53.37 1 LLaV A [8] LLaMA-7B 36.96 8 23.75 16 33.52 9 MiniGPT-4 [7] Flan-T5-XL 47.40 4 29.89 7 42.84 4 VPGTrans [40] LLaMA-7B 41.81 5 31.40 5 39.10 5 MultiModal-GPT [12] LLaMA-7B 34.54 12 29.21 10 33.15 11 Otter [11] LLaMA-7B 35.16 11 30.35 6 33.91 8 OpenFlamingo [41] LLaMA-7B 34.","citing_arxiv_id":"2307.16125"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","primary_cat":"cs.CV","context_text":"It should be noted that we have also tried to design instructions with multiple choice questions, but find that it may beyond the capabilities of current MLLMs to follow complex instructions. We conduct massive experiments to evaluate the zero-shot performance of 30 advanced MLLMs on the 14 subtasks. The evaluated MLLMs include BLIP-2 [25], InstructBLIP [12], MiniGPT-4 [66], PandaGPT [41], Multimodal-GPT [16], VisualGLM-6B [5], ImageBind-LLM [18], VPGTrans [58], LaVIN [35], mPLUG-Owl [52], Octopus [3], Muffin [56], Otter [23], LRV-Instruction [29], Chee- tor [24], LLaMA-Adapter-v2 [15], GIT2 [45], BLIV A [19], Lynx [57], MMICL [61], GPT-4V [39], Skywork-MM [4], mPLUG-Owl2 [ 52], Qwen-VL-Chat [ 9], XComposer-VL [ 7], LLaV A [30],","citing_arxiv_id":"2306.13394"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"A Survey of Large Language Models","primary_cat":"cs.CL","context_text":"Currently, ChatGPT can be powered by either GPT-3.5 or GPT-4 models. Overall, one select the suitable model interface based on the specific application scenarios and response requirements. The detailed usage can be found on their project websites 20. TABLE 2: Statistics of commonly-used data sources. Corpora Size Source Latest Update Time BookCorpus [158] 5GB Books Dec-2015 Gutenberg [159] - Books Dec-2021 C4 [82] 800GB CommonCrawl Apr-2019 CC-Stories-R [160] 31GB CommonCrawl Sep-2019 CC-NEWS [27] 78GB CommonCrawl Feb-2019 REALNEWs [161] 120GB CommonCrawl Apr-2019 OpenWebText [162] 38GB Reddit links Mar-2023 Pushift.io [163] 2TB Reddit links Mar-2023 Wikipedia [164] 21GB Wikipedia Mar-2023 BigQuery [165] - Codes Mar-2023","citing_arxiv_id":"2303.18223"}]},"error":null,"updated_at":"2026-05-17T05:59:34.632000+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-17T05:59:38.439547+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","claims":[{"claim_text":"Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"averaged performance on three dimensions for evaluating temporal understanding. Model Type Model Language Model Spatial Temporal Overall Acc Rank Acc Rank Acc Rank LLM Flan-T5 [1] Flan-T5-XL 27.32 17 28.56 11 27.65 17 Vicuna [4] Vicuna-7B 28.16 16 29.46 8 28.50 16 LLaMA [5] LLaMA-7B 26.56 18 27.27 13 26.75 18 ImageLLM BLIP-2 [6] Flan-T5-XL 49.74 3 36.71 3 46.35 3 InstructBLIP [10] Flan-T5-XL 57.80 2 38.31 1 52.73 2 InstructBLIP Vicuna [10] Vicuna-7B 58.76 1 38.05 2 53.37 1 LLaV A [8] LLaMA-7B 36","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":", Markdown) [29]. However, this modality trans- formation is not only limited by the recognition ability of external tools, but also destroys the inherent 2D physical topological structure and spatial alignment of complex tables, especially those with hierarchical headers [41,50]. Recently, with the rapid development of Multimodal Large Language Models (MLLMs) [1,3,15], the research community has begun to explore unified and end-to-end methods for image-based table reasoning, which aims to prese","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"It should be noted that we have also tried to design instructions with multiple choice questions, but find that it may beyond the capabilities of current MLLMs to follow complex instructions. We conduct massive experiments to evaluate the zero-shot performance of 30 advanced MLLMs on the 14 subtasks. The evaluated MLLMs include BLIP-2 [25], InstructBLIP [12], MiniGPT-4 [66], PandaGPT [41], Multimodal-GPT [16], VisualGLM-6B [5], ImageBind-LLM [18], VPGTrans [58], LaVIN [35], mPLUG-Owl [52], Octop","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Figure 1: CoME-VL uses token entropy analysis to identify complementary layer ranges from multiple vision encoders (SigLIP2 and DINOv3). By composing all SigLIP2 layers (which exhibit high entropy, capturing diverse semantic features) with the low-entropy DINOv3 layers 10-23 (which encode strong spatial features), CoME-VL achieves consistent improvements over the Molmo [15] baseline (single-encoder), averaging +4.9% on visual understanding/generation and +5.4% on grounding tasks. Abstract Recent","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"variant 4 2496 48 19968 39 5985M 1553G 28.3 / 65.3 65.9 variant 5 2816 64 11264 44 6095M 1589G 21.6 / 61.4 66.2 variant 6 2496 80 9984 39 5985M 1564G 16.9 / 60.1 66.2 Table 11. Comparison of hyperparameters in InternViT-6B. The throughput (img/s) and GFLOPs are measured at 224×224 in- put resolution, with a batch size of 1 or 128 on a single A100 GPU. Flash Attention [35] and bf16 precision are used during testing. \"zs IN\" denotes the zero-shot top-1 accuracy on the ImageNet-1K validation set [3","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Details on Simulation-Free Training of Flows Following (Lipman et al., 2023), to see that ut(z) generates pt, we note that the continuity equation provides a necessary and sufficient condition (Villani, 2008): d dt pt(x) + ∇ · [pt(x)vt(x)] = 0 ↔ vt generates probability density path pt. (26) Therefore it suffices to show that −∇ · [ut(z)pt(z)] = −∇ · [Eϵ∼N(0,I)ut(z|ϵ) pt(z|ϵ) pt(z) pt(z)] (27) = Eϵ∼N(0,I) − ∇ · [ut(z|ϵ)pt(z|ϵ)] (28) = Eϵ∼N(0,I) d dt pt(z|ϵ) = d dt pt(z), (29) where we used the c","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (6 contexts).","role_counts":[{"n":6,"context_role":"background"},{"n":4,"context_role":"baseline"},{"n":2,"context_role":"method"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-17T05:59:38.444324+00:00"}},"summary":{"title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","claims":[{"claim_text":"Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"averaged performance on three dimensions for evaluating temporal understanding. Model Type Model Language Model Spatial Temporal Overall Acc Rank Acc Rank Acc Rank LLM Flan-T5 [1] Flan-T5-XL 27.32 17 28.56 11 27.65 17 Vicuna [4] Vicuna-7B 28.16 16 29.46 8 28.50 16 LLaMA [5] LLaMA-7B 26.56 18 27.27 13 26.75 18 ImageLLM BLIP-2 [6] Flan-T5-XL 49.74 3 36.71 3 46.35 3 InstructBLIP [10] Flan-T5-XL 57.80 2 38.31 1 52.73 2 InstructBLIP Vicuna [10] Vicuna-7B 58.76 1 38.05 2 53.37 1 LLaV A [8] LLaMA-7B 36","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":", Markdown) [29]. However, this modality trans- formation is not only limited by the recognition ability of external tools, but also destroys the inherent 2D physical topological structure and spatial alignment of complex tables, especially those with hierarchical headers [41,50]. Recently, with the rapid development of Multimodal Large Language Models (MLLMs) [1,3,15], the research community has begun to explore unified and end-to-end methods for image-based table reasoning, which aims to prese","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"It should be noted that we have also tried to design instructions with multiple choice questions, but find that it may beyond the capabilities of current MLLMs to follow complex instructions. We conduct massive experiments to evaluate the zero-shot performance of 30 advanced MLLMs on the 14 subtasks. The evaluated MLLMs include BLIP-2 [25], InstructBLIP [12], MiniGPT-4 [66], PandaGPT [41], Multimodal-GPT [16], VisualGLM-6B [5], ImageBind-LLM [18], VPGTrans [58], LaVIN [35], mPLUG-Owl [52], Octop","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Figure 1: CoME-VL uses token entropy analysis to identify complementary layer ranges from multiple vision encoders (SigLIP2 and DINOv3). By composing all SigLIP2 layers (which exhibit high entropy, capturing diverse semantic features) with the low-entropy DINOv3 layers 10-23 (which encode strong spatial features), CoME-VL achieves consistent improvements over the Molmo [15] baseline (single-encoder), averaging +4.9% on visual understanding/generation and +5.4% on grounding tasks. Abstract Recent","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"variant 4 2496 48 19968 39 5985M 1553G 28.3 / 65.3 65.9 variant 5 2816 64 11264 44 6095M 1589G 21.6 / 61.4 66.2 variant 6 2496 80 9984 39 5985M 1564G 16.9 / 60.1 66.2 Table 11. Comparison of hyperparameters in InternViT-6B. The throughput (img/s) and GFLOPs are measured at 224×224 in- put resolution, with a batch size of 1 or 128 on a single A100 GPU. Flash Attention [35] and bf16 precision are used during testing. \"zs IN\" denotes the zero-shot top-1 accuracy on the ImageNet-1K validation set [3","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Details on Simulation-Free Training of Flows Following (Lipman et al., 2023), to see that ut(z) generates pt, we note that the continuity equation provides a necessary and sufficient condition (Villani, 2008): d dt pt(x) + ∇ · [pt(x)vt(x)] = 0 ↔ vt generates probability density path pt. (26) Therefore it suffices to show that −∇ · [ut(z)pt(z)] = −∇ · [Eϵ∼N(0,I)ut(z|ϵ) pt(z|ϵ) pt(z) pt(z)] (27) = Eϵ∼N(0,I) − ∇ · [ut(z|ϵ)pt(z|ϵ)] (28) = Eϵ∼N(0,I) d dt pt(z|ϵ) = d dt pt(z), (29) where we used the c","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (6 contexts).","role_counts":[{"n":6,"context_role":"background"},{"n":4,"context_role":"baseline"},{"n":2,"context_role":"method"},{"n":1,"context_role":"dataset"}]},"graph":{"co_cited":[{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":28},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":22},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":22},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":22},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":21},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":21},{"title":"Evaluating Object Hallucination in Large Vision-Language Models","work_id":"66d8ac3e-c134-4995-b528-550afa17586f","shared_citers":17},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":17},{"title":"mplug-owl: Modularization empowers large lan- guage models with multimodality","work_id":"74a7deb6-48be-4132-9d35-882cc5870ebd","shared_citers":17},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":15},{"title":"MMBench: Is Your Multi-modal Model an All-around Player?","work_id":"3b44943d-0f15-4228-9ac3-0e376f4f9ada","shared_citers":15},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":13},{"title":"Otter: A Multi-Modal Model with In-Context Instruction Tuning","work_id":"33cb3a7a-6091-48db-a246-802bbb055f43","shared_citers":12},{"title":"LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model","work_id":"0fe2cfd8-d442-4ceb-b1a9-a465704f39b2","shared_citers":11},{"title":"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic","work_id":"44525076-312a-4259-b79c-134cd7eeb297","shared_citers":11},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":10},{"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","shared_citers":10},{"title":"MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities","work_id":"7f3bac41-a0a5-4a7a-bfd2-526b616db745","shared_citers":9},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":9},{"title":"Kosmos-2: Grounding Multimodal Large Language Models to the World","work_id":"46e7f9e9-24c6-49af-b7d5-96159fa6f443","shared_citers":8},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":8},{"title":"Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning","work_id":"ba8e8164-e47f-42d6-83ad-696cb57ee79a","shared_citers":8},{"title":"Multimodal-gpt: A vision and language model for dialogue with humans","work_id":"e5fb1f2e-4ed2-454f-87a3-9e9c40f8fa31","shared_citers":8},{"title":"PandaGPT: One Model To Instruction-Follow Them All","work_id":"b3689b4d-65c2-45ae-84da-01b291742486","shared_citers":8}],"time_series":[{"n":20,"year":2023},{"n":9,"year":2024},{"n":4,"year":2025},{"n":14,"year":2026}],"dependency_candidates":[{"n":1,"role":"baseline","polarity":"baseline","paper_title":"CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning","primary_cat":"cs.CV","context_text":"Figure 1: CoME-VL uses token entropy analysis to identify complementary layer ranges from multiple vision encoders (SigLIP2 and DINOv3). By composing all SigLIP2 layers (which exhibit high entropy, capturing diverse semantic features) with the low-entropy DINOv3 layers 10-23 (which encode strong spatial features), CoME-VL achieves consistent improvements over the Molmo [15] baseline (single-encoder), averaging +4.9% on visual understanding/generation and +5.4% on grounding tasks. Abstract Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual en-","citing_arxiv_id":"2604.03231"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"Hallucination of Multimodal Large Language Models: A Survey","primary_cat":"cs.CV","context_text":"Hal-EvalIn-domainEvent. F1↑ Hal-EvalOut-of-domainEvent. F1↑ mPLUG-Owl [184] 7B 120.00 50.00 50.00 55.00 68.06 66.79 66.82 69.3 31.2 54.1 47 46.6ImageBind-LLM [54] 7B 128.33 60.00 46.67 73.33 - - - - - - - -InstructBLIP [36] (7B) 7B - - - - - - - 89.1 82.6 86.2 66.2 66.6InstructBLIP [36] (13B) 13B 185.00 143.33 66.67 153.33 89.29 83.45 78.45 84.7 - - - -VisualGLM-6B [39] 6B 85.00 50.00 48.33 55.00 - - - - - - - -Multimodal-GPT [47] 7B 61.67 55.00 58.33 68.33 66.68 66.67 66.67 - - - - -PandaGPT [147] 7B 70.00 50.00 50.00 50.00 - - - - - - - -LaVIN [118] 13B 185.00 88.33 63.33 75.00 - - - - - - - -Cheetor [99] 7B 180.00 96.67 80.00 116.67 - - - - - - - -GPT-4V [125] - 190.00 160.00 95.00 150.00 - - - - 89.6 92.7 - -LLaVA [111] (7B) 7B - - - - - - - 73.","citing_arxiv_id":"2404.18930"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Scaling Rectified Flow Transformers for High-Resolution Image Synthesis","primary_cat":"cs.CV","context_text":"Details on Simulation-Free Training of Flows Following (Lipman et al., 2023), to see that ut(z) generates pt, we note that the continuity equation provides a necessary and sufficient condition (Villani, 2008): d dt pt(x) + ∇ · [pt(x)vt(x)] = 0 ↔ vt generates probability density path pt. (26) Therefore it suffices to show that −∇ · [ut(z)pt(z)] = −∇ · [Eϵ∼N(0,I)ut(z|ϵ) pt(z|ϵ) pt(z) pt(z)] (27) = Eϵ∼N(0,I) − ∇ · [ut(z|ϵ)pt(z|ϵ)] (28) = Eϵ∼N(0,I) d dt pt(z|ϵ) = d dt pt(z), (29) where we used the continuity equation Equation (26) for ut(z|ϵ) in line Equation (28) to Equation (29) since ut(z|ϵ) generates pt(z|ϵ) and the definition of Equation (6) in line Equation (27) The equivalence of objectives LF M ⇋ LCF M (Lipman et al., 2023) follows from","citing_arxiv_id":"2403.03206"},{"n":1,"role":"method","polarity":"use_method","paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","context_text":"variant 4 2496 48 19968 39 5985M 1553G 28.3 / 65.3 65.9 variant 5 2816 64 11264 44 6095M 1589G 21.6 / 61.4 66.2 variant 6 2496 80 9984 39 5985M 1564G 16.9 / 60.1 66.2 Table 11. Comparison of hyperparameters in InternViT-6B. The throughput (img/s) and GFLOPs are measured at 224×224 in- put resolution, with a batch size of 1 or 128 on a single A100 GPU. Flash Attention [35] and bf16 precision are used during testing. \"zs IN\" denotes the zero-shot top-1 accuracy on the ImageNet-1K validation set [38]. The final selected model is marked in gray . visual glue LLM dataset dialogue caption visual question answering encoder layer MME NoCaps OKVQA VizWizval GQA EV A-E MLP V-7B 665K [91] 970.5 75.1 40.1 25.5 41.3 IViT-6B MLP V-7B 665K [91] 1022.","citing_arxiv_id":"2312.14238"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension","primary_cat":"cs.CL","context_text":"averaged performance on three dimensions for evaluating temporal understanding. Model Type Model Language Model Spatial Temporal Overall Acc Rank Acc Rank Acc Rank LLM Flan-T5 [1] Flan-T5-XL 27.32 17 28.56 11 27.65 17 Vicuna [4] Vicuna-7B 28.16 16 29.46 8 28.50 16 LLaMA [5] LLaMA-7B 26.56 18 27.27 13 26.75 18 ImageLLM BLIP-2 [6] Flan-T5-XL 49.74 3 36.71 3 46.35 3 InstructBLIP [10] Flan-T5-XL 57.80 2 38.31 1 52.73 2 InstructBLIP Vicuna [10] Vicuna-7B 58.76 1 38.05 2 53.37 1 LLaV A [8] LLaMA-7B 36.96 8 23.75 16 33.52 9 MiniGPT-4 [7] Flan-T5-XL 47.40 4 29.89 7 42.84 4 VPGTrans [40] LLaMA-7B 41.81 5 31.40 5 39.10 5 MultiModal-GPT [12] LLaMA-7B 34.54 12 29.21 10 33.15 11 Otter [11] LLaMA-7B 35.16 11 30.35 6 33.91 8 OpenFlamingo [41] LLaMA-7B 34.","citing_arxiv_id":"2307.16125"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","primary_cat":"cs.CV","context_text":"It should be noted that we have also tried to design instructions with multiple choice questions, but find that it may beyond the capabilities of current MLLMs to follow complex instructions. We conduct massive experiments to evaluate the zero-shot performance of 30 advanced MLLMs on the 14 subtasks. The evaluated MLLMs include BLIP-2 [25], InstructBLIP [12], MiniGPT-4 [66], PandaGPT [41], Multimodal-GPT [16], VisualGLM-6B [5], ImageBind-LLM [18], VPGTrans [58], LaVIN [35], mPLUG-Owl [52], Octopus [3], Muffin [56], Otter [23], LRV-Instruction [29], Chee- tor [24], LLaMA-Adapter-v2 [15], GIT2 [45], BLIV A [19], Lynx [57], MMICL [61], GPT-4V [39], Skywork-MM [4], mPLUG-Owl2 [ 52], Qwen-VL-Chat [ 9], XComposer-VL [ 7], LLaV A [30],","citing_arxiv_id":"2306.13394"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"A Survey of Large Language Models","primary_cat":"cs.CL","context_text":"Currently, ChatGPT can be powered by either GPT-3.5 or GPT-4 models. Overall, one select the suitable model interface based on the specific application scenarios and response requirements. The detailed usage can be found on their project websites 20. TABLE 2: Statistics of commonly-used data sources. Corpora Size Source Latest Update Time BookCorpus [158] 5GB Books Dec-2015 Gutenberg [159] - Books Dec-2021 C4 [82] 800GB CommonCrawl Apr-2019 CC-Stories-R [160] 31GB CommonCrawl Sep-2019 CC-NEWS [27] 78GB CommonCrawl Feb-2019 REALNEWs [161] 120GB CommonCrawl Apr-2019 OpenWebText [162] 38GB Reddit links Mar-2023 Pushift.io [163] 2TB Reddit links Mar-2023 Wikipedia [164] 21GB Wikipedia Mar-2023 BigQuery [165] - Codes Mar-2023","citing_arxiv_id":"2303.18223"}]},"authors":[]}}