InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Vintern-1b: An efficient multimodal large language model for vietnamese.arXiv preprint arXiv:2408.12480
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
U-CESE integrates three CESE modules into a unified clip-based pipeline with DAKE keyframe extraction and ReCap captioning to support consistent multimodal event retrieval across video sources.
Proposes a multimodal pipeline for video retrieval that incorporates information from multiple frames to enable higher-level abstraction beyond single-image object detection.
citing papers explorer
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
U-CESE: Unified Clip-based Event Search Engine for AI Challenge HCMC 2025
U-CESE integrates three CESE modules into a unified clip-based pipeline with DAKE keyframe extraction and ReCap captioning to support consistent multimodal event retrieval across video sources.
-
Multimodal Contextualized Support for Enhancing Video Retrieval System
Proposes a multimodal pipeline for video retrieval that incorporates information from multiple frames to enable higher-level abstraction beyond single-image object detection.