Grit: A generative region-to-text transformer for object understanding

Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang · 2022 · arXiv 2212.00280

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

method 2 background 1

citation-polarity summary

use method 2 background 1

representative citing papers

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

cs.CL · 2023-07-30 · unverdicted · novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

VideoChat: Chat-Centric Video Understanding

cs.CV · 2023-05-10 · conditional · novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

cs.CV · 2023-09-29 · conditional · novelty 4.0

GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

citing papers explorer

Showing 4 of 4 citing papers.

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models cs.CV · 2026-05-05 · unverdicted · none · ref 32 · 2 links
WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension cs.CL · 2023-07-30 · unverdicted · none · ref 28
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
VideoChat: Chat-Centric Video Understanding cs.CV · 2023-05-10 · conditional · none · ref 48
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) cs.CV · 2023-09-29 · conditional · none · ref 138
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

Grit: A generative region-to-text transformer for object understanding

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer