Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.

ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos

cs.CV · 2026-04-10 · accept · novelty 6.0

ACCIDENT is a new benchmark with 2,027 real and 2,211 synthetic annotated video clips for temporal localization, spatial localization, and collision type classification of vehicle accidents in CCTV footage.

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

cs.CV · 2025-11-18

citing papers explorer

Showing 3 of 3 citing papers.

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding cs.CV · 2026-05-15 · unverdicted · none · ref 18
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos cs.CV · 2026-04-10 · accept · none · ref 13
ACCIDENT is a new benchmark with 2,027 real and 2,211 synthetic annotated video clips for temporal localization, spatial localization, and collision type classification of vehicle accidents in CCTV footage.
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs cs.CV · 2025-11-18 · unreviewed · ref 16

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

fields

years

verdicts

representative citing papers

citing papers explorer