SCLARO: A Dataset for Grounded Scenario-Level Scene Understanding and ScenarioCLIP for Benchmarking

Aashutosh A V; Abhijit Das; Advik Sinha; Saurabh Atreya; Sk Aziz Ali

read the original abstract

In the paradigm of computer vision-based precise real-world scene understanding, joint reasoning in terms of contextual understanding about the objects present in a scene, their inter-object relations, and the action being performed is an essential prerequisite. However, prior works have not addressed all three jointly, and no large-scale dataset provides grounded annotations at all three levels across diverse visual scenarios. Hence, this work introduces the SCLARO (Scene-Contextual Localisation of Actions, Relations & Objects) dataset, consisting of 615,805 images spanning indoor, outdoor, and driving scenarios, annotated with global action captions, object bounding boxes, and relation triplets that supply structured scene context beyond a free-text caption. To benchmark the dataset, we propose ScenarioCLIP, a tri-level reference model that jointly encodes global scene context, objects, and inter-object relations using disentangled encoders and EMA-based knowledge distillation. We benchmark across a comprehensive suite of tasks on the SCLARO Dataset, namely zero-shot retrieval, linear probe, object detection, predicate classification, scene-graph classification, and out-of-domain generalisation. ScenarioCLIP's disentangled encoders improve over the previous works, such as PyramidCLIP's shared encoder, most notably at the object and relation levels and on out-of-domain generalisation. Code for the data generation pipeline and ScenarioCLIP is available at https://github.com/scenario-clip/SCLARO-ScenarioCLIP

SCLARO: A Dataset for Grounded Scenario-Level Scene Understanding and ScenarioCLIP for Benchmarking

discussion (0)