{"work":{"id":"0c16c250-fd0f-446a-bbb0-ea8dd0ba5ccd","openalex_id":null,"doi":null,"arxiv_id":"2303.15389","raw_key":null,"title":"EVA-CLIP: Improved Training Techniques for CLIP at Scale","authors":null,"authors_text":"Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao","year":2023,"venue":"cs.CV","abstract":"Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billion seen samples achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open access and open research, we release the complete suite of EVA-CLIP to the community at https://github.com/baaivision/EVA/tree/master/EVA-CLIP.","external_url":"https://arxiv.org/abs/2303.15389","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-24T03:33:50.834294+00:00","pith_arxiv_id":"2303.15389","created_at":"2026-05-08T21:44:18.095586+00:00","updated_at":"2026-05-24T03:33:50.834294+00:00","title_quality_ok":true,"display_title":"EVA-CLIP: Improved Training Techniques for CLIP at Scale","render_title":"EVA-CLIP: Improved Training Techniques for CLIP at Scale"},"hub":{"state":{"work_id":"0c16c250-fd0f-446a-bbb0-ea8dd0ba5ccd","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":74,"external_cited_by_count":null,"distinct_field_count":7,"first_pith_cited_at":"2023-03-27T15:53:01+00:00","last_pith_cited_at":"2026-05-20T17:12:55+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-27T21:07:55.398206+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":13},{"context_role":"method","n":4},{"context_role":"baseline","n":1},{"context_role":"dataset","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":13},{"context_polarity":"use_method","n":4},{"context_polarity":"baseline","n":1},{"context_polarity":"unclear","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-16T08:28:48.955079+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":10},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":8},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":8},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":8},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":8},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":8},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":8},{"title":"Kosmos-2: Grounding Multimodal Large Language Models to the World","work_id":"46e7f9e9-24c6-49af-b7d5-96159fa6f443","shared_citers":7},{"title":"Microsoft COCO Captions: Data Collection and Evaluation Server","work_id":"b3d6fb46-4169-4a28-8f7e-2ca6774211da","shared_citers":7},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":6},{"title":"Demysti- fying clip data","work_id":"7af932a5-c6d2-4f55-9522-7777cc3fa5ae","shared_citers":6},{"title":"Evaluating Object Hallucination in Large Vision-Language Models","work_id":"66d8ac3e-c134-4995-b528-550afa17586f","shared_citers":6},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":6},{"title":"OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models","work_id":"87bfa84a-e663-4165-806f-93ef439d88d0","shared_citers":6},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":6},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":5},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":5},{"title":"ShareGPT4V: Improving Large Multi-Modal Models with Better Captions","work_id":"90e2b26a-3d27-4567-86b5-929b582a8034","shared_citers":5},{"title":"EVA-CLIP- 18B: Scaling clip to 18 billion parameters","work_id":"46cb9433-2fa4-4f5c-93d8-106868c12440","shared_citers":4},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":4},{"title":"Florence: A new foundation model for computer vision","work_id":"99823072-36a8-4b10-9ef5-a7f91da74650","shared_citers":4},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":4}],"time_series":[{"n":4,"year":2023},{"n":2,"year":2024},{"n":3,"year":2025},{"n":34,"year":2026}],"dependency_candidates":[{"n":1,"role":"baseline","polarity":"baseline","paper_title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","primary_cat":"cs.CV","context_text":"224 49 MetaCLIP [66] 67.7 59.6 - 52.8 - 46.6 - 72.9 - - - 256 64 OpenCLIP [27] 72.8 64.8 - 59.6 - 39.9 57.9 64.9 84.8 - - SigLIP 2 74.0 66.9 81.4 66.1 66.6 47.2 63.7 75.5 89.3 38.3 49.0 B/16 224 196 CLIP [50] 68.3 61.9 - 55.3 - 33.1 52.4 62.1 81.9 - - OpenCLIP [27] 70.2 62.3 - 56.0 - 42.3 59.4 69.8 86.3 - - MetaCLIP [66] 72.4 65.1 - 60.0 - 48.9 - 77.1 - - - EVA-CLIP [57] 74.7 67.0 - 62.3 - 42.2 58.7 71.2 85.7 - - SigLIP [71] 76.2 69.5 82.8 70.7 69.9 47.2 64.5 77.9 89.6 22.4 29.3 DFN [19] 76.2 68.2 - 63.2 - 51.9 - 77.3 - - - SigLIP 2 78.2 71.4 84.8 73.6 72.1 52.1 68.9 80.7 93.0 40.3 50.7 256 256 SigLIP [71] 76.7 70.1 83.1 71.3 70.3 47.4 65.1 78.3 91.1 22.5 29.9 SigLIP 2 79.1 72.5 85.4 74.5 73.1 53.2 69.7 81.7 94.4 40.","citing_arxiv_id":"2502.14786"},{"n":1,"role":"method","polarity":"use_method","paper_title":"VideoChat: Chat-Centric Video Understanding","primary_cat":"cs.CV","context_text":"Figure 2: Architecture and training paradigm of VideoChat-Embed. It is built on BLIP-2 [18] and StableVicuna [10]. The training contains two-stage alignment and instruction tuning. 3.2.1 Architecture In this paper, we instantiate the VideoChat-Embed based on BLIP-2 [18] and StableVicuna [10](Figure 2a). Concretely, we incorporate the pretrained ViT-G [39] with Global Multi-Head Relation Aggrega- tor (GMHRA), a temporal modeling module used in InternVideo [46] and UniFormerV2 [20]. For the token interface, we employ the pretrained QFormer with extra linear projection, supplemented by additional query tokens to account for video context modeling. This allows us to obtain compact LLM-compatible video embeddings for dialogues.","citing_arxiv_id":"2305.06355"}]},"error":null,"updated_at":"2026-05-16T08:28:47.361698+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-16T08:28:47.323221+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"EVA-CLIP: Improved Training Techniques for CLIP at Scale","claims":[{"claim_text":"Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion se","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"[116] Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324, 2025. [117] JD Open Source. Joyai-image: Awakening spatial intelligence in unified multimodal understanding and generation, 2026. URLhttps://github.com/jd-opensource/JoyAI-Image. [118] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Figure 2: Architecture and training paradigm of VideoChat-Embed. It is built on BLIP-2 [18] and StableVicuna [10]. The training contains two-stage alignment and instruction tuning. 3.2.1 Architecture In this paper, we instantiate the VideoChat-Embed based on BLIP-2 [18] and StableVicuna [10](Figure 2a). Concretely, we incorporate the pretrained ViT-G [39] with Global Multi-Head Relation Aggrega- tor (GMHRA), a temporal modeling module used in InternVideo [46] and UniFormerV2 [20]. For the token ","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 5 [46] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4 [47] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv prep","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748-8763. PmLR, 2021. [36] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937-13949, 2021. [37] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at sc","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"2 Related Work In this section, we first review existing 3D representation learning methods based on vision-language pretraining, and then summarize commonly used 3D scene datasets for pretraining and the evaluation protocols for vision-language models. 3D Vision-Language Pretraining.3D vision-language pretraining aligns a 3D encoder with pretrained CLIP models [17,43,47,49] and has become a com- mon paradigm for 3D representation learning. Most previous works adopt point clouds as the input mod","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"224 49 MetaCLIP [66] 67.7 59.6 - 52.8 - 46.6 - 72.9 - - - 256 64 OpenCLIP [27] 72.8 64.8 - 59.6 - 39.9 57.9 64.9 84.8 - - SigLIP 2 74.0 66.9 81.4 66.1 66.6 47.2 63.7 75.5 89.3 38.3 49.0 B/16 224 196 CLIP [50] 68.3 61.9 - 55.3 - 33.1 52.4 62.1 81.9 - - OpenCLIP [27] 70.2 62.3 - 56.0 - 42.3 59.4 69.8 86.3 - - MetaCLIP [66] 72.4 65.1 - 60.0 - 48.9 - 77.1 - - - EVA-CLIP [57] 74.7 67.0 - 62.3 - 42.2 58.7 71.2 85.7 - - SigLIP [71] 76.2 69.5 82.8 70.7 69.9 47.2 64.5 77.9 89.6 22.4 29.3 DFN [19] 76.2 68","claim_type":"baseline","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks EVA-CLIP: Improved Training Techniques for CLIP at Scale because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (7 contexts).","role_counts":[{"n":7,"context_role":"background"},{"n":1,"context_role":"baseline"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-16T08:28:47.365408+00:00"}},"summary":{"title":"EVA-CLIP: Improved Training Techniques for CLIP at Scale","claims":[{"claim_text":"Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion se","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"[116] Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324, 2025. [117] JD Open Source. Joyai-image: Awakening spatial intelligence in unified multimodal understanding and generation, 2026. URLhttps://github.com/jd-opensource/JoyAI-Image. [118] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Figure 2: Architecture and training paradigm of VideoChat-Embed. It is built on BLIP-2 [18] and StableVicuna [10]. The training contains two-stage alignment and instruction tuning. 3.2.1 Architecture In this paper, we instantiate the VideoChat-Embed based on BLIP-2 [18] and StableVicuna [10](Figure 2a). Concretely, we incorporate the pretrained ViT-G [39] with Global Multi-Head Relation Aggrega- tor (GMHRA), a temporal modeling module used in InternVideo [46] and UniFormerV2 [20]. For the token ","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 5 [46] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4 [47] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv prep","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748-8763. PmLR, 2021. [36] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937-13949, 2021. [37] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at sc","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"2 Related Work In this section, we first review existing 3D representation learning methods based on vision-language pretraining, and then summarize commonly used 3D scene datasets for pretraining and the evaluation protocols for vision-language models. 3D Vision-Language Pretraining.3D vision-language pretraining aligns a 3D encoder with pretrained CLIP models [17,43,47,49] and has become a com- mon paradigm for 3D representation learning. Most previous works adopt point clouds as the input mod","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"224 49 MetaCLIP [66] 67.7 59.6 - 52.8 - 46.6 - 72.9 - - - 256 64 OpenCLIP [27] 72.8 64.8 - 59.6 - 39.9 57.9 64.9 84.8 - - SigLIP 2 74.0 66.9 81.4 66.1 66.6 47.2 63.7 75.5 89.3 38.3 49.0 B/16 224 196 CLIP [50] 68.3 61.9 - 55.3 - 33.1 52.4 62.1 81.9 - - OpenCLIP [27] 70.2 62.3 - 56.0 - 42.3 59.4 69.8 86.3 - - MetaCLIP [66] 72.4 65.1 - 60.0 - 48.9 - 77.1 - - - EVA-CLIP [57] 74.7 67.0 - 62.3 - 42.2 58.7 71.2 85.7 - - SigLIP [71] 76.2 69.5 82.8 70.7 69.9 47.2 64.5 77.9 89.6 22.4 29.3 DFN [19] 76.2 68","claim_type":"baseline","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks EVA-CLIP: Improved Training Techniques for CLIP at Scale because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (7 contexts).","role_counts":[{"n":7,"context_role":"background"},{"n":1,"context_role":"baseline"},{"n":1,"context_role":"method"}]},"graph":{"co_cited":[{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":10},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":8},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":8},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":8},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":8},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":8},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":8},{"title":"Kosmos-2: Grounding Multimodal Large Language Models to the World","work_id":"46e7f9e9-24c6-49af-b7d5-96159fa6f443","shared_citers":7},{"title":"Microsoft COCO Captions: Data Collection and Evaluation Server","work_id":"b3d6fb46-4169-4a28-8f7e-2ca6774211da","shared_citers":7},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":6},{"title":"Demysti- fying clip data","work_id":"7af932a5-c6d2-4f55-9522-7777cc3fa5ae","shared_citers":6},{"title":"Evaluating Object Hallucination in Large Vision-Language Models","work_id":"66d8ac3e-c134-4995-b528-550afa17586f","shared_citers":6},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":6},{"title":"OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models","work_id":"87bfa84a-e663-4165-806f-93ef439d88d0","shared_citers":6},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":6},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":5},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":5},{"title":"ShareGPT4V: Improving Large Multi-Modal Models with Better Captions","work_id":"90e2b26a-3d27-4567-86b5-929b582a8034","shared_citers":5},{"title":"EVA-CLIP- 18B: Scaling clip to 18 billion parameters","work_id":"46cb9433-2fa4-4f5c-93d8-106868c12440","shared_citers":4},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":4},{"title":"Florence: A new foundation model for computer vision","work_id":"99823072-36a8-4b10-9ef5-a7f91da74650","shared_citers":4},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":4}],"time_series":[{"n":4,"year":2023},{"n":2,"year":2024},{"n":3,"year":2025},{"n":34,"year":2026}],"dependency_candidates":[{"n":1,"role":"baseline","polarity":"baseline","paper_title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","primary_cat":"cs.CV","context_text":"224 49 MetaCLIP [66] 67.7 59.6 - 52.8 - 46.6 - 72.9 - - - 256 64 OpenCLIP [27] 72.8 64.8 - 59.6 - 39.9 57.9 64.9 84.8 - - SigLIP 2 74.0 66.9 81.4 66.1 66.6 47.2 63.7 75.5 89.3 38.3 49.0 B/16 224 196 CLIP [50] 68.3 61.9 - 55.3 - 33.1 52.4 62.1 81.9 - - OpenCLIP [27] 70.2 62.3 - 56.0 - 42.3 59.4 69.8 86.3 - - MetaCLIP [66] 72.4 65.1 - 60.0 - 48.9 - 77.1 - - - EVA-CLIP [57] 74.7 67.0 - 62.3 - 42.2 58.7 71.2 85.7 - - SigLIP [71] 76.2 69.5 82.8 70.7 69.9 47.2 64.5 77.9 89.6 22.4 29.3 DFN [19] 76.2 68.2 - 63.2 - 51.9 - 77.3 - - - SigLIP 2 78.2 71.4 84.8 73.6 72.1 52.1 68.9 80.7 93.0 40.3 50.7 256 256 SigLIP [71] 76.7 70.1 83.1 71.3 70.3 47.4 65.1 78.3 91.1 22.5 29.9 SigLIP 2 79.1 72.5 85.4 74.5 73.1 53.2 69.7 81.7 94.4 40.","citing_arxiv_id":"2502.14786"},{"n":1,"role":"method","polarity":"use_method","paper_title":"VideoChat: Chat-Centric Video Understanding","primary_cat":"cs.CV","context_text":"Figure 2: Architecture and training paradigm of VideoChat-Embed. It is built on BLIP-2 [18] and StableVicuna [10]. The training contains two-stage alignment and instruction tuning. 3.2.1 Architecture In this paper, we instantiate the VideoChat-Embed based on BLIP-2 [18] and StableVicuna [10](Figure 2a). Concretely, we incorporate the pretrained ViT-G [39] with Global Multi-Head Relation Aggrega- tor (GMHRA), a temporal modeling module used in InternVideo [46] and UniFormerV2 [20]. For the token interface, we employ the pretrained QFormer with extra linear projection, supplemented by additional query tokens to account for video context modeling. This allows us to obtain compact LLM-compatible video embeddings for dialogues.","citing_arxiv_id":"2305.06355"}]},"authors":[]}}