DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
hub Baseline reference
Ku, Qian Liu, and Wenhu Chen
Baseline reference. 70% of citing Pith papers use this work as a benchmark or comparison.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
All 18 audited MLLMs exhibit order sensitivity with per-facet flip rates of 24-50%, exceeding same-order decoder noise.
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
Pareto LoRA applies Pareto-optimal gradient integration to balance text and image objectives in LoRA-based fine-tuning of unified multimodal models, reporting up to 44.9% gains in image quality on the CoMM benchmark with Emu2 while preserving text performance.
HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.
LoMo is a lightweight data curation technique that locally substitutes text with images in prompts to enforce cross-modal invariance, yielding 2.67-2.82 point gains over standard SFT on two VLMs across 13 benchmarks.
ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
CrossView Suite supplies a 1.6M-sample dataset, scene-disjoint benchmark, and explicit-alignment framework to advance MLLMs from single-view perception to cross-view spatial intelligence.
An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.
citing papers explorer
No citing papers match the current filters.