GRIT introduces a grounded reasoning paradigm for MLLMs where reasoning chains interleave text and bounding boxes, trained via GRPO-GR reinforcement learning on as few as 20 examples without annotations.
Mme: A comprehensive evaluation benchmark for multimodal large language models
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3years
2025 3roles
dataset 1polarities
use dataset 1representative citing papers
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
citing papers explorer
-
GRIT: Teaching MLLMs to Think with Images
GRIT introduces a grounded reasoning paradigm for MLLMs where reasoning chains interleave text and bounding boxes, trained via GRPO-GR reinforcement learning on as few as 20 examples without annotations.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.