{"work":{"id":"e2121c51-a55e-476a-af81-7ba6970fe6cf","openalex_id":null,"doi":null,"arxiv_id":"2311.10122","raw_key":null,"title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection","authors":null,"authors_text":"Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin","year":2023,"venue":"cs.CV","abstract":"The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \\href{https://github.com/PKU-YuanGroup/Video-LLaVA}","external_url":"https://arxiv.org/abs/2311.10122","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T07:55:33.247789+00:00","pith_arxiv_id":"2311.10122","created_at":"2026-05-09T22:34:07.310959+00:00","updated_at":"2026-05-25T07:55:33.247789+00:00","title_quality_ok":true,"display_title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection","render_title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection"},"hub":{"state":{"work_id":"e2121c51-a55e-476a-af81-7ba6970fe6cf","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":68,"external_cited_by_count":null,"distinct_field_count":6,"first_pith_cited_at":"2023-12-19T02:39:27+00:00","last_pith_cited_at":"2026-05-21T10:13:03+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-03T16:36:13.334186+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":7},{"context_role":"baseline","n":3},{"context_role":"method","n":2},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":8},{"context_polarity":"baseline","n":3},{"context_polarity":"unclear","n":1},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-17T05:39:33.402869+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":20},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":19},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":19},{"title":"Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding","work_id":"555cf04a-49a7-44b8-9019-a83ce85ace95","shared_citers":16},{"title":"VideoChat: Chat-Centric Video Understanding","work_id":"07461eec-156c-4054-a28e-b84bc53bf6e1","shared_citers":15},{"title":"Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models","work_id":"51f627f4-8fae-4882-a3e9-abdf932ef27b","shared_citers":15},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":14},{"title":"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis","work_id":"77fd5ac9-ae98-4846-9d83-e9c73c8f2a52","shared_citers":14},{"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","work_id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","shared_citers":11},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":10},{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":10},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":9},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":9},{"title":"PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning","work_id":"8949d4db-20f2-47c1-83a6-fcbe041b62ef","shared_citers":9},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":8},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":8},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":8},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":8},{"title":"Long Context Transfer from Language to Vision","work_id":"52f1b946-568f-4819-9d8a-a87296f8852d","shared_citers":8},{"title":"MLVU: Benchmarking Multi-task Long Video Understanding","work_id":"346256da-dd21-4cc3-9a98-519467614854","shared_citers":8},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":8},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":8},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":8},{"title":"The Kinetics Human Action Video Dataset","work_id":"c8a3de61-cfd3-4aeb-bcf7-a0372c015748","shared_citers":8}],"time_series":[{"n":12,"year":2024},{"n":11,"year":2025},{"n":24,"year":2026}],"dependency_candidates":[{"n":1,"role":"baseline","polarity":"baseline","paper_title":"UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation","primary_cat":"cs.CV","context_text":"\"×\" indicates the model is incapable of performing the task. Model Understanding Image Generation Image Editing MMBV MMBI MMMU MM-Vet GenEval WISE Overall Add Adjust Extract Replace Remove Hybird Image Understanding LLaV A-1.5 [25] × 36.4 67.8 36.3 × × × × × × × × × LLaV A-NeXT [57] × 79.3 51.1 57.4 × × × × × × × × × Image & Video Understanding Video-LLaV A [22] 1.05 60.9 32.8 32.0 × × × × × × × × × LLaV A-OV [17] 0.94 80.8 48.8 57.5 × × × × × × × × × Text-to-Image Generation SDXL [34] × × × × 0.55 0.55 × × × × × × × FLUX.1 Dev [16] × × × × 0.66 0.50 × × × × × × × Image Editing MagicBrush [56] × × × × × × 1.83 2.84 1.58 1.51 1.97 1.58 1.62 Instruct-P2P [3] × × × × × × 1.88 2.45 1.83 1.44 2.01 1.50 1.20 AnyEdit [49] × × × × × × 2.","citing_arxiv_id":"2506.03147"},{"n":1,"role":"method","polarity":"use_method","paper_title":"ImgEdit: A Unified Image Editing Dataset and Benchmark","primary_cat":"cs.CV","context_text":"backbone [13], as illustrated in Figure 2. The edit instruction and the original image are jointly fed into VLM, while the image is processed simultaneously by the vision encoder. The hidden states of VLM and the visual feature of the vision encoder are separately projected by MLPs and then concatenated, forming the text-branch input to DiT. Training proceeds in two stages [41], first optimizing MLPs and then jointly fine-tuning FLUX and MLPs. 3.4 Dataset Statistics ImgEdit comprises 1.2 million high-quality image-editing pairs spanning 13 editing cat- egories, including 110k multi-turn examples. Compared with existing datasets [ 79, 18, 86, 83, 28, 6], ImgEdit offers richer semantics, more detailed prompts, higher resolutions,","citing_arxiv_id":"2505.20275"},{"n":1,"role":"method","polarity":"background","paper_title":"LLaVA-OneVision: Easy Visual Task Transfer","primary_cat":"cs.CV","context_text":"[74] Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization, 2019. 40 [75] Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning, 2023. 39 [76] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 2 [77] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on","citing_arxiv_id":"2408.03326"}]},"error":null,"updated_at":"2026-05-17T05:39:37.258575+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-17T05:39:37.207803+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection","claims":[{"claim_text":"The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the found","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"\"×\" indicates the model is incapable of performing the task. Model Understanding Image Generation Image Editing MMBV MMBI MMMU MM-Vet GenEval WISE Overall Add Adjust Extract Replace Remove Hybird Image Understanding LLaV A-1.5 [25] × 36.4 67.8 36.3 × × × × × × × × × LLaV A-NeXT [57] × 79.3 51.1 57.4 × × × × × × × × × Image & Video Understanding Video-LLaV A [22] 1.05 60.9 32.8 32.0 × × × × × × × × × LLaV A-OV [17] 0.94 80.8 48.8 57.5 × × × × × × × × × Text-to-Image Generation SDXL [34] × × × × 0","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"backbone [13], as illustrated in Figure 2. The edit instruction and the original image are jointly fed into VLM, while the image is processed simultaneously by the vision encoder. The hidden states of VLM and the visual feature of the vision encoder are separately projected by MLPs and then concatenated, forming the text-branch input to DiT. Training proceeds in two stages [41], first optimizing MLPs and then jointly fine-tuning FLUX and MLPs. 3.4 Dataset Statistics ImgEdit comprises 1.2 million","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"SigLIP outperforms the other two vision encoders, especially in fine-grained understanding tasks involving texts. Based on this ablation study, we choose the pretrained SigLIP as our base vision encoder, and then adapt it to taking dynamic resolutions as inputs. 5 Related Work Multimodal LLMs for Native Video Understanding. Early video MLLMs primarily relied on sparsely sampled frames and simple connectors, such as MLPs [12, 13, 139], discrete visual tokenizers [140], and Q-formers [141, 142], t","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"SpatialLadder: Progressive training for spatial reasoning in vision-language models.arXiv, abs/2510.08531, 2025. [31] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, volume 202, pages 19730-19742, 2023. [32] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"[74] Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization, 2019. 40 [75] Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning, 2023. 39 [76] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united","claim_type":"method","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"language models. InECCV, 2024. 3 [37] Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Mapsparse: Accelerating pre-filling for long-context visual language models via modality-aware permutation sparse attention. InICLR 2025 Workshop on Foundation Models in the Wild. 5 [38] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alig","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Video-LLaVA: Learning United Visual Representation by Alignment Before Projection because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (3 contexts).","role_counts":[{"n":3,"context_role":"background"},{"n":2,"context_role":"method"},{"n":1,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-17T05:39:37.211901+00:00"}},"summary":{"title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection","claims":[{"claim_text":"The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the found","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"\"×\" indicates the model is incapable of performing the task. Model Understanding Image Generation Image Editing MMBV MMBI MMMU MM-Vet GenEval WISE Overall Add Adjust Extract Replace Remove Hybird Image Understanding LLaV A-1.5 [25] × 36.4 67.8 36.3 × × × × × × × × × LLaV A-NeXT [57] × 79.3 51.1 57.4 × × × × × × × × × Image & Video Understanding Video-LLaV A [22] 1.05 60.9 32.8 32.0 × × × × × × × × × LLaV A-OV [17] 0.94 80.8 48.8 57.5 × × × × × × × × × Text-to-Image Generation SDXL [34] × × × × 0","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"backbone [13], as illustrated in Figure 2. The edit instruction and the original image are jointly fed into VLM, while the image is processed simultaneously by the vision encoder. The hidden states of VLM and the visual feature of the vision encoder are separately projected by MLPs and then concatenated, forming the text-branch input to DiT. Training proceeds in two stages [41], first optimizing MLPs and then jointly fine-tuning FLUX and MLPs. 3.4 Dataset Statistics ImgEdit comprises 1.2 million","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"SigLIP outperforms the other two vision encoders, especially in fine-grained understanding tasks involving texts. Based on this ablation study, we choose the pretrained SigLIP as our base vision encoder, and then adapt it to taking dynamic resolutions as inputs. 5 Related Work Multimodal LLMs for Native Video Understanding. Early video MLLMs primarily relied on sparsely sampled frames and simple connectors, such as MLPs [12, 13, 139], discrete visual tokenizers [140], and Q-formers [141, 142], t","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"SpatialLadder: Progressive training for spatial reasoning in vision-language models.arXiv, abs/2510.08531, 2025. [31] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, volume 202, pages 19730-19742, 2023. [32] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"[74] Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization, 2019. 40 [75] Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning, 2023. 39 [76] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united","claim_type":"method","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"language models. InECCV, 2024. 3 [37] Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Mapsparse: Accelerating pre-filling for long-context visual language models via modality-aware permutation sparse attention. InICLR 2025 Workshop on Foundation Models in the Wild. 5 [38] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alig","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Video-LLaVA: Learning United Visual Representation by Alignment Before Projection because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (3 contexts).","role_counts":[{"n":3,"context_role":"background"},{"n":2,"context_role":"method"},{"n":1,"context_role":"baseline"}]},"graph":{"co_cited":[{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":20},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":19},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":19},{"title":"Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding","work_id":"555cf04a-49a7-44b8-9019-a83ce85ace95","shared_citers":16},{"title":"VideoChat: Chat-Centric Video Understanding","work_id":"07461eec-156c-4054-a28e-b84bc53bf6e1","shared_citers":15},{"title":"Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models","work_id":"51f627f4-8fae-4882-a3e9-abdf932ef27b","shared_citers":15},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":14},{"title":"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis","work_id":"77fd5ac9-ae98-4846-9d83-e9c73c8f2a52","shared_citers":14},{"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","work_id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","shared_citers":11},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":10},{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":10},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":9},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":9},{"title":"PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning","work_id":"8949d4db-20f2-47c1-83a6-fcbe041b62ef","shared_citers":9},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":8},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":8},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":8},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":8},{"title":"Long Context Transfer from Language to Vision","work_id":"52f1b946-568f-4819-9d8a-a87296f8852d","shared_citers":8},{"title":"MLVU: Benchmarking Multi-task Long Video Understanding","work_id":"346256da-dd21-4cc3-9a98-519467614854","shared_citers":8},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":8},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":8},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":8},{"title":"The Kinetics Human Action Video Dataset","work_id":"c8a3de61-cfd3-4aeb-bcf7-a0372c015748","shared_citers":8}],"time_series":[{"n":12,"year":2024},{"n":11,"year":2025},{"n":24,"year":2026}],"dependency_candidates":[{"n":1,"role":"baseline","polarity":"baseline","paper_title":"UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation","primary_cat":"cs.CV","context_text":"\"×\" indicates the model is incapable of performing the task. Model Understanding Image Generation Image Editing MMBV MMBI MMMU MM-Vet GenEval WISE Overall Add Adjust Extract Replace Remove Hybird Image Understanding LLaV A-1.5 [25] × 36.4 67.8 36.3 × × × × × × × × × LLaV A-NeXT [57] × 79.3 51.1 57.4 × × × × × × × × × Image & Video Understanding Video-LLaV A [22] 1.05 60.9 32.8 32.0 × × × × × × × × × LLaV A-OV [17] 0.94 80.8 48.8 57.5 × × × × × × × × × Text-to-Image Generation SDXL [34] × × × × 0.55 0.55 × × × × × × × FLUX.1 Dev [16] × × × × 0.66 0.50 × × × × × × × Image Editing MagicBrush [56] × × × × × × 1.83 2.84 1.58 1.51 1.97 1.58 1.62 Instruct-P2P [3] × × × × × × 1.88 2.45 1.83 1.44 2.01 1.50 1.20 AnyEdit [49] × × × × × × 2.","citing_arxiv_id":"2506.03147"},{"n":1,"role":"method","polarity":"use_method","paper_title":"ImgEdit: A Unified Image Editing Dataset and Benchmark","primary_cat":"cs.CV","context_text":"backbone [13], as illustrated in Figure 2. The edit instruction and the original image are jointly fed into VLM, while the image is processed simultaneously by the vision encoder. The hidden states of VLM and the visual feature of the vision encoder are separately projected by MLPs and then concatenated, forming the text-branch input to DiT. Training proceeds in two stages [41], first optimizing MLPs and then jointly fine-tuning FLUX and MLPs. 3.4 Dataset Statistics ImgEdit comprises 1.2 million high-quality image-editing pairs spanning 13 editing cat- egories, including 110k multi-turn examples. Compared with existing datasets [ 79, 18, 86, 83, 28, 6], ImgEdit offers richer semantics, more detailed prompts, higher resolutions,","citing_arxiv_id":"2505.20275"},{"n":1,"role":"method","polarity":"background","paper_title":"LLaVA-OneVision: Easy Visual Task Transfer","primary_cat":"cs.CV","context_text":"[74] Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization, 2019. 40 [75] Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning, 2023. 39 [76] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 2 [77] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on","citing_arxiv_id":"2408.03326"}]},"authors":[]}}