A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
fields
cs.CV 3representative citing papers
ACCIDENT is a new benchmark with 2,027 real and 2,211 synthetic annotated video clips for temporal localization, spatial localization, and collision type classification of vehicle accidents in CCTV footage.
citing papers explorer
-
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
-
ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos
ACCIDENT is a new benchmark with 2,027 real and 2,211 synthetic annotated video clips for temporal localization, spatial localization, and collision type classification of vehicle accidents in CCTV footage.
- MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs