{"work":{"id":"68be622d-a6dc-4a13-82de-e3054a3dc509","openalex_id":null,"doi":null,"arxiv_id":"2304.08485","raw_key":null,"title":"Visual Instruction Tuning","authors":null,"authors_text":"Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee","year":2023,"venue":"cs.CV","abstract":"Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.","external_url":"https://arxiv.org/abs/2304.08485","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T08:05:30.722847+00:00","pith_arxiv_id":"2304.08485","created_at":"2026-05-09T23:54:45.370240+00:00","updated_at":"2026-05-25T08:05:30.722847+00:00","title_quality_ok":false,"display_title":"Visual Instruction Tuning","render_title":"Visual Instruction Tuning"},"hub":{"state":{"work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":134,"external_cited_by_count":null,"distinct_field_count":11,"first_pith_cited_at":"2023-03-28T17:59:12+00:00","last_pith_cited_at":"2026-05-21T11:30:29+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-27T21:07:55.135843+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":39},{"context_role":"baseline","n":4},{"context_role":"method","n":4},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":38},{"context_polarity":"baseline","n":4},{"context_polarity":"use_method","n":4},{"context_polarity":"unclear","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Visual Instruction Tuning","claims":[{"claim_text":"Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experime","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"advancement is Contrastive Language-Image Pre-training (CLIP) [13], initially published in 2021 , a training technique designed to solve large -scale image understanding. Alternative methods, such as ALIGN and BLIP, were also published during this time. These fundamental training techniques produce neural network structures integrated into LLVMs, like the Large Language-and-Vision Assistant (LLaVA) method [14]. 2.2 Contrastive Language-Image Pre-training CLIP takes two separate neural network ar","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"resentations and action generation, which motivates a task-guided graph to optimize the relational structure and provide transferable support for action prediction. 2 • We build a real-world robotic dataset to fine-tune our model, and extensive experiments show that the framework achieves state-of-the-art performance across diverse generalizable robotic tasks. 2 Related Work Vision-Language-Action Models.Vision-language models [ 15, 16, 17, 18, 19] have advanced scene reasoning and inspired VLA ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"performs real-time inference at the edge while offloading labeling and model assistance to the cloud [39]. Another line explores model decoupling-JALAD partitions DNN execution between edge and cloud to jointly optimize latency and transmission cost [22]. The second direction focuses on MLLM-enabled edge-cloud col- laborative inference. In recent years, MLLMs, such as BLIP-2 [23], LLaVa [25], GPT-5 [32], and Qwen2.5-VL [2], have greatly improved cross-modal understanding and structured generatio","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"active dialogue across tasks. These datasets have proven critical for training general-purpose multimodal assistants, as they improve zero-shot reasoning and align models with user intent. General-Domain Instruction Datasets.General purpose instruction datasetshavebeendevelopedtohelptrainmultimodalLLMstoworkacrossa 6 variety of tasks. For instance, Liu et al. [41] introduced LLaVA-Instruct , us- ing GPT-4 to generate 158K image-based instruction-response pairs for train- ing VLMs. This approach ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Model Type Model Language Model Spatial Temporal Overall Acc Rank Acc Rank Acc Rank LLM Flan-T5 [1] Flan-T5-XL 27.32 17 28.56 11 27.65 17 Vicuna [4] Vicuna-7B 28.16 16 29.46 8 28.50 16 LLaMA [5] LLaMA-7B 26.56 18 27.27 13 26.75 18 ImageLLM BLIP-2 [6] Flan-T5-XL 49.74 3 36.71 3 46.35 3 InstructBLIP [10] Flan-T5-XL 57.80 2 38.31 1 52.73 2 InstructBLIP Vicuna [10] Vicuna-7B 58.76 1 38.05 2 53.37 1 LLaV A [8] LLaMA-7B 36.96 8 23.75 16 33.52 9 MiniGPT-4 [7] Flan-T5-XL 47.40 4 29.89 7 42.84 4 VPGTrans","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"modeling have augmented LLMs with visual encoders, giving rise to large vision-language models (VLMs) capable of jointly reasoning over text and visual inputs [3]. These models have enabled a new class of deployment scenarios, including au- tonomous driving [4], [5], embodied agents [6], [7], interactive assistants [8], [9], and video question-answering [10], [11]. While early VLM studies primarily centered on static image-text pairs, many of these emerging applications operate over continuous v","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Visual Instruction Tuning because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (30 contexts).","role_counts":[{"n":30,"context_role":"background"},{"n":4,"context_role":"baseline"},{"n":3,"context_role":"method"}]},"error":null,"updated_at":"2026-05-19T00:21:24.971753+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"f6e528d4-1fed-4015-981f-3a38357ac07d","orcid":null,"display_name":"Haotian Liu"},{"id":"4c582d1e-c6c5-47f7-b526-e877960cf613","orcid":null,"display_name":"Chunyuan Li"},{"id":"3fa13fa5-8e0f-475f-a9ab-dfe6dd1e144b","orcid":null,"display_name":"Qingyang Wu"},{"id":"a02d550a-6956-46c1-9c39-875f68d99648","orcid":null,"display_name":"Yong Jae Lee"}]},"error":null,"updated_at":"2026-05-19T00:21:25.369492+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T10:59:32.420197+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":20},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":19},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":18},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":14},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":13},{"title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","work_id":"f3aac728-ded0-4e55-aa9e-4a1635d4313d","shared_citers":11},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":11},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":10},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":10},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":9},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":9},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":9},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":9},{"title":"Language is not all you need: Aligning perception with language models","work_id":"2a1e0563-79f5-4521-8293-b8b1aebf7cee","shared_citers":8},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":8},{"title":"mplug-owl: Modularization empowers large language models with multimodality","work_id":"74a7deb6-48be-4132-9d35-882cc5870ebd","shared_citers":8},{"title":"Otter: A multi-modal model with in-context instruction tuning","work_id":"33cb3a7a-6091-48db-a246-802bbb055f43","shared_citers":8},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":8},{"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","shared_citers":8},{"title":"Kosmos-2: Grounding Multimodal Large Language Models to the World","work_id":"46e7f9e9-24c6-49af-b7d5-96159fa6f443","shared_citers":7},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":7},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":7},{"title":"ShareGPT4V: Improving Large Multi-Modal Models with Better Captions","work_id":"90e2b26a-3d27-4567-86b5-929b582a8034","shared_citers":7},{"title":"A Survey on In-context Learning","work_id":"864701ca-cb36-4a91-9be8-e2b9b20679aa","shared_citers":6}],"time_series":[{"n":13,"year":2023},{"n":9,"year":2024},{"n":2,"year":2025},{"n":39,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T11:09:44.216433+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T10:59:34.724502+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Visual Instruction Tuning","claims":[{"claim_text":"Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experime","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"advancement is Contrastive Language-Image Pre-training (CLIP) [13], initially published in 2021 , a training technique designed to solve large -scale image understanding. Alternative methods, such as ALIGN and BLIP, were also published during this time. These fundamental training techniques produce neural network structures integrated into LLVMs, like the Large Language-and-Vision Assistant (LLaVA) method [14]. 2.2 Contrastive Language-Image Pre-training CLIP takes two separate neural network ar","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"resentations and action generation, which motivates a task-guided graph to optimize the relational structure and provide transferable support for action prediction. 2 • We build a real-world robotic dataset to fine-tune our model, and extensive experiments show that the framework achieves state-of-the-art performance across diverse generalizable robotic tasks. 2 Related Work Vision-Language-Action Models.Vision-language models [ 15, 16, 17, 18, 19] have advanced scene reasoning and inspired VLA ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"performs real-time inference at the edge while offloading labeling and model assistance to the cloud [39]. Another line explores model decoupling-JALAD partitions DNN execution between edge and cloud to jointly optimize latency and transmission cost [22]. The second direction focuses on MLLM-enabled edge-cloud col- laborative inference. In recent years, MLLMs, such as BLIP-2 [23], LLaVa [25], GPT-5 [32], and Qwen2.5-VL [2], have greatly improved cross-modal understanding and structured generatio","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"active dialogue across tasks. These datasets have proven critical for training general-purpose multimodal assistants, as they improve zero-shot reasoning and align models with user intent. General-Domain Instruction Datasets.General purpose instruction datasetshavebeendevelopedtohelptrainmultimodalLLMstoworkacrossa 6 variety of tasks. For instance, Liu et al. [41] introduced LLaVA-Instruct , us- ing GPT-4 to generate 158K image-based instruction-response pairs for train- ing VLMs. This approach ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Model Type Model Language Model Spatial Temporal Overall Acc Rank Acc Rank Acc Rank LLM Flan-T5 [1] Flan-T5-XL 27.32 17 28.56 11 27.65 17 Vicuna [4] Vicuna-7B 28.16 16 29.46 8 28.50 16 LLaMA [5] LLaMA-7B 26.56 18 27.27 13 26.75 18 ImageLLM BLIP-2 [6] Flan-T5-XL 49.74 3 36.71 3 46.35 3 InstructBLIP [10] Flan-T5-XL 57.80 2 38.31 1 52.73 2 InstructBLIP Vicuna [10] Vicuna-7B 58.76 1 38.05 2 53.37 1 LLaV A [8] LLaMA-7B 36.96 8 23.75 16 33.52 9 MiniGPT-4 [7] Flan-T5-XL 47.40 4 29.89 7 42.84 4 VPGTrans","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"modeling have augmented LLMs with visual encoders, giving rise to large vision-language models (VLMs) capable of jointly reasoning over text and visual inputs [3]. These models have enabled a new class of deployment scenarios, including au- tonomous driving [4], [5], embodied agents [6], [7], interactive assistants [8], [9], and video question-answering [10], [11]. While early VLM studies primarily centered on static image-text pairs, many of these emerging applications operate over continuous v","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Visual Instruction Tuning because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (30 contexts).","role_counts":[{"n":30,"context_role":"background"},{"n":4,"context_role":"baseline"},{"n":3,"context_role":"method"}]},"error":null,"updated_at":"2026-05-19T00:21:24.975709+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Visual Instruction Tuning","claims":[{"claim_text":"Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experime","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Visual Instruction Tuning because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T11:09:33.406083+00:00"}},"summary":{"title":"Visual Instruction Tuning","claims":[{"claim_text":"Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experime","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Visual Instruction Tuning because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":20},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":19},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":18},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":14},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":13},{"title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","work_id":"f3aac728-ded0-4e55-aa9e-4a1635d4313d","shared_citers":11},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":11},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":10},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":10},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":9},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":9},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":9},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":9},{"title":"Language is not all you need: Aligning perception with language models","work_id":"2a1e0563-79f5-4521-8293-b8b1aebf7cee","shared_citers":8},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":8},{"title":"mplug-owl: Modularization empowers large language models with multimodality","work_id":"74a7deb6-48be-4132-9d35-882cc5870ebd","shared_citers":8},{"title":"Otter: A multi-modal model with in-context instruction tuning","work_id":"33cb3a7a-6091-48db-a246-802bbb055f43","shared_citers":8},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":8},{"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","shared_citers":8},{"title":"Kosmos-2: Grounding Multimodal Large Language Models to the World","work_id":"46e7f9e9-24c6-49af-b7d5-96159fa6f443","shared_citers":7},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":7},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":7},{"title":"ShareGPT4V: Improving Large Multi-Modal Models with Better Captions","work_id":"90e2b26a-3d27-4567-86b5-929b582a8034","shared_citers":7},{"title":"A Survey on In-context Learning","work_id":"864701ca-cb36-4a91-9be8-e2b9b20679aa","shared_citers":6}],"time_series":[{"n":13,"year":2023},{"n":9,"year":2024},{"n":2,"year":2025},{"n":39,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"4c582d1e-c6c5-47f7-b526-e877960cf613","orcid":null,"display_name":"Chunyuan Li","source":"manual","import_confidence":0.72},{"id":"f6e528d4-1fed-4015-981f-3a38357ac07d","orcid":null,"display_name":"Haotian Liu","source":"manual","import_confidence":0.72},{"id":"3fa13fa5-8e0f-475f-a9ab-dfe6dd1e144b","orcid":null,"display_name":"Qingyang Wu","source":"manual","import_confidence":0.72},{"id":"a02d550a-6956-46c1-9c39-875f68d99648","orcid":null,"display_name":"Yong Jae Lee","source":"manual","import_confidence":0.72}]}}