Tool reference

Wanjuan: A comprehensive multimodal dataset for ad- vancing english and chinese large models

Conghui He, Zhenjiang Jin, Chaoxi Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, Da Lin · 2023 · arXiv 2308.10755

Tool reference. 80% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

7 Pith papers citing it

Method reference 80% of classified citations

read on arXiv browse 7 citing papers

citation-role summary

dataset 3 background 1 method 1

citation-polarity summary

use dataset 3 background 1 use method 1

representative citing papers

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

cs.CV · 2023-07-13 · unverdicted · novelty 6.0

InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

cs.CV · 2024-12-13 · accept · novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

cs.CV · 2024-07-03 · conditional · novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

cs.CV · 2024-01-29 · unverdicted · novelty 5.0

InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

cs.CV · 2024-04-25 · unverdicted · novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

cs.CV · 2023-09-26 · conditional · novelty 4.0

InternLM-XComposer generates articles with seamlessly integrated images and achieves state-of-the-art results on vision-language benchmarks including MME, MMBench, and Seed-Bench.

citing papers explorer

Showing 7 of 7 citing papers.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 81
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation cs.CV · 2023-07-13 · unverdicted · none · ref 17
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding cs.CV · 2024-12-13 · accept · none · ref 30
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 47
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model cs.CV · 2024-01-29 · unverdicted · none · ref 32
InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 31
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition cs.CV · 2023-09-26 · conditional · none · ref 30
InternLM-XComposer generates articles with seamlessly integrated images and achieves state-of-the-art results on vision-language benchmarks including MME, MMBench, and Seed-Bench.

Wanjuan: A comprehensive multimodal dataset for ad- vancing english and chinese large models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer