Tool reference

Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models

Wanjuan: A comprehensive multimodal dataset for advancing english, chinese large models , author= · 2023 · arXiv 2308.10755

Tool reference. 80% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

9 Pith papers citing it

Method reference 80% of classified citations

read on arXiv browse 9 citing papers

citation-role summary

dataset 3 background 1 method 1

citation-polarity summary

use dataset 3 background 1 use method 1

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

cs.CV · 2023-07-13 · unverdicted · novelty 6.0

InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

cs.CV · 2024-12-13 · accept · novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

cs.CV · 2024-07-03 · conditional · novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

cs.CV · 2024-01-29 · unverdicted · novelty 5.0

InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

cs.CL · 2026-06-13 · unverdicted · novelty 4.0

Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

cs.CV · 2024-04-25 · unverdicted · novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

cs.CV · 2023-09-26 · conditional · novelty 4.0

InternLM-XComposer generates articles with seamlessly integrated images and achieves state-of-the-art results on vision-language benchmarks including MME, MMBench, and Seed-Bench.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer