GRIT introduces a grounded reasoning paradigm for MLLMs where reasoning chains interleave text and bounding boxes, trained via GRPO-GR reinforcement learning on as few as 20 examples without annotations.
R1-v: Reinforcing super generaliza- tion ability in vision-language models with less than $3.https://github.com/Deep-Agent/ R1-V, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2025 2representative citing papers
RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.
citing papers explorer
-
GRIT: Teaching MLLMs to Think with Images
GRIT introduces a grounded reasoning paradigm for MLLMs where reasoning chains interleave text and bounding boxes, trained via GRPO-GR reinforcement learning on as few as 20 examples without annotations.
-
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.