CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Justin Johnson , Bharath Hariharan , Laurens van der Maaten , Li Fei-Fei , C. Lawrence Zitnick , Ross Girshick

Authors on Pith no claims yet

classification 💻 cs.CV cs.CLcs.LG

keywords reasoningvisualdatasetdiagnosticabilitiesanalyzeanswerbiases

read the original abstract

When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pinpoint model weaknesses. We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains
cs.RO 2026-04 unverdicted novelty 6.0

RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.