hub

Building and Better Understand- ing Vision-Language Models: Insights and Future Directions

· 2024 · arXiv 2408.12637

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1 method 1

citation-polarity summary

background 2 use dataset 1 use method 1

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

PBS-VL trained on the new PBSInstr dataset outperforms general and pathology MLLMs on the PBSBench VQA tasks for hematopathology.

MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models

cs.CV · 2025-09-26 · unverdicted · novelty 7.0

MultiMat shows multimodal large models plus constrained search produce higher-quality procedural material graphs than text-only baselines on a new production dataset.

InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

cs.CL · 2025-08-11 · unverdicted · novelty 7.0

InterChart is a new benchmark that reveals steep drops in VLM accuracy when moving from single-chart facts to integrative reasoning over 2-3 related charts, with better performance after decomposing complex charts.

Zamba2-VL Technical Report

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

Zamba2-VL is a family of 1.2B–7B hybrid Mamba2-transformer vision-language models that match leading transformer VLMs on image, reasoning, OCR, grounding and counting benchmarks while delivering roughly 10x lower time-to-first-token.

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

cs.LG · 2026-05-12 · conditional · novelty 6.0 · 2 refs

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates

cs.CV · 2026-04-17 · unverdicted · novelty 6.0

DenTab provides 2,000 annotated dental table images and 2,208 questions to benchmark 16 systems on table structure recognition and VQA, revealing that strong layout recovery does not ensure reliable multi-step arithmetic, and proposes a Table Router Pipeline combining VLMs with rule-based execution.

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

cs.CV · 2026-02-07 · unverdicted · novelty 6.0

Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained categories with 4-shot training.

SmolVLM: Redefining small and efficient multimodal models

cs.AI · 2025-04-07 · unverdicted · novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.

NVILA: Efficient Frontier Visual Language Models

cs.CV · 2024-12-05 · unverdicted · novelty 5.0

NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.

VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

cs.CV · 2026-05-25 · unverdicted · novelty 4.0

VEN-VL introduces an enrich-then-compact visual ensemble MoE approach claiming superior performance-efficiency trade-off in multimodal tasks using fewer condensed visual tokens.

ZAYA1-VL-8B Technical Report

cs.CV · 2026-05-08 · unverdicted · novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.

citing papers explorer

Showing 16 of 16 citing papers.

DataComp-VLM: Improved Open Datasets for Vision-Language Models cs.CV · 2026-06-26 · conditional · none · ref 143 · 2 links
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 54
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark cs.CL · 2024-09-04 · accept · none · ref 19
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation cs.CV · 2026-04-19 · unverdicted · none · ref 23
PBS-VL trained on the new PBSInstr dataset outperforms general and pathology MLLMs on the PBSBench VQA tasks for hematopathology.
MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models cs.CV · 2025-09-26 · unverdicted · none · ref 11
MultiMat shows multimodal large models plus constrained search produce higher-quality procedural material graphs than text-only baselines on a new production dataset.
InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information cs.CL · 2025-08-11 · unverdicted · none · ref 9
InterChart is a new benchmark that reveals steep drops in VLM accuracy when moving from single-chart facts to integrative reasoning over 2-3 related charts, with better performance after decomposing complex charts.
Zamba2-VL Technical Report cs.CV · 2026-05-29 · unverdicted · none · ref 114
Zamba2-VL is a family of 1.2B–7B hybrid Mamba2-transformer vision-language models that match leading transformer VLMs on image, reasoning, OCR, grounding and counting benchmarks while delivering roughly 10x lower time-to-first-token.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone cs.LG · 2026-05-12 · conditional · none · ref 21 · 2 links
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates cs.CV · 2026-04-17 · unverdicted · none · ref 16
DenTab provides 2,000 annotated dental table images and 2,208 questions to benchmark 16 systems on table structure recognition and VQA, revealing that strong layout recovery does not ensure reliable multi-step arithmetic, and proposes a Table Router Pipeline combining VLMs with rule-based execution.
Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning cs.CV · 2026-02-07 · unverdicted · none · ref 14
Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained categories with 4-shot training.
SmolVLM: Redefining small and efficient multimodal models cs.AI · 2025-04-07 · unverdicted · none · ref 21
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 121
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models cs.AI · 2026-04-13 · unverdicted · none · ref 8
Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
NVILA: Efficient Frontier Visual Language Models cs.CV · 2024-12-05 · unverdicted · none · ref 96
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding cs.CV · 2026-05-25 · unverdicted · none · ref 4
VEN-VL introduces an enrich-then-compact visual ensemble MoE approach claiming superior performance-efficiency trade-off in multimodal tasks using fewer condensed visual tokens.
ZAYA1-VL-8B Technical Report cs.CV · 2026-05-08 · unverdicted · none · ref 132
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.

Building and Better Understand- ing Vision-Language Models: Insights and Future Directions

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer