Visual Semantic Role Labeling

Saurabh Gupta , Jitendra Malik

Authors on Pith no claims yet

classification 💻 cs.CV

keywords actionsemanticdoingobjectsactionsassociatedifferentimage

read the original abstract

In this paper we introduce the problem of Visual Semantic Role Labeling: given an image we want to detect people doing actions and localize the objects of interaction. Classical approaches to action recognition either study the task of action classification at the image or video clip level or at best produce a bounding box around the person doing the action. We believe such an output is inadequate and a complete understanding can only come when we are able to associate objects in the scene to the different semantic roles of the action. To enable progress towards this goal, we annotate a dataset of 16K people instances in 10K images with actions they are doing and associate objects in the scene with different semantic roles for each action. Finally, we provide a set of baseline algorithms for this task and analyze error modes providing directions for future work.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
cs.CV 2026-05 unverdicted novelty 7.0

ScriptHOI decomposes HOI phrases into state slots and uses script coverage, conflict, interval partial-label learning, and counterfactual contrast to improve rare and unseen interaction detection while cutting afforda...
Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching
cs.CV 2026-04 unverdicted novelty 7.0

FlowSG recasts scene graph generation as progressive flow matching on a hybrid discrete-continuous state using VQ-VAE tokens and a graph Transformer, delivering roughly 3-point gains over prior one-shot methods on VG and PSG.
ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
cs.CV 2026-05 unverdicted novelty 6.0

ScriptHOI improves rare and unseen HOI recognition by decomposing phrases into state slots, using visual tokenization and slot-wise matching for script coverage and conflict to calibrate predictions and constrain trai...
A Study of Failure Modes in Two-Stage Human-Object Interaction Detection
cs.CV 2026-04 unverdicted novelty 4.0

A diagnostic study shows that two-stage HOI models fail differently across scene configurations like multi-person and rare interactions, revealing that aggregate benchmark accuracy does not imply robust visual reasoning.