Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3roles
baseline 1polarities
baseline 1representative citing papers
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
VLM-AutoDrive adapts pretrained VLMs via metadata captions, LLM descriptions, VQA, and CoT supervision, lifting collision F1 from 0.00 to 0.69 and accuracy from 35.35% to 77.27% on Nexar dashcam videos.
citing papers explorer
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
-
CogVLM2: Visual Language Models for Image and Video Understanding
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
-
VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events
VLM-AutoDrive adapts pretrained VLMs via metadata captions, LLM descriptions, VQA, and CoT supervision, lifting collision F1 from 0.00 to 0.69 and accuracy from 35.35% to 77.27% on Nexar dashcam videos.