From Recognition to Cognition: Visual Commonsense Reasoning

· 2018 · cs.CV · arXiv 1811.10830

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

representative citing papers

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

cs.CL · 2026-06-29 · unverdicted · novelty 5.0

A single LLM rewrite of skill descriptions using false positive and negative cases matches manual optimization performance in production, with most other pipeline components adding little value.

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

cs.CL · 2026-06-14 · unverdicted · novelty 4.0

Machine interpreting should shift from fidelity metrics to three design priorities—agency, grounding, and experience—drawn from interpreting studies to close the usability gap with human-mediated communication.

citing papers explorer

Showing 2 of 2 citing papers after filters.

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization cs.CL · 2026-06-29 · unverdicted · none · ref 53 · internal anchor
A single LLM rewrite of skill descriptions using false positive and negative cases matches manual optimization performance in production, with most other pipeline components adding little value.
Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design cs.CL · 2026-06-14 · unverdicted · none · ref 110 · internal anchor
Machine interpreting should shift from fidelity metrics to three design priorities—agency, grounding, and experience—drawn from interpreting studies to close the usability gap with human-mediated communication.

From Recognition to Cognition: Visual Commonsense Reasoning

fields

years

verdicts

representative citing papers

citing papers explorer