pith. sign in

arxiv: 2511.20274 · v2 · pith:FCMCLP6Vnew · submitted 2025-11-25 · 💻 cs.CV

SCLARO: A Dataset for Grounded Scenario-Level Scene Understanding and ScenarioCLIP for Benchmarking

classification 💻 cs.CV
keywords datasetscenescenarioclipobjectobjectsrelationssclarounderstanding
0
0 comments X
read the original abstract

In the paradigm of computer vision-based precise real-world scene understanding, joint reasoning in terms of contextual understanding about the objects present in a scene, their inter-object relations, and the action being performed is an essential prerequisite. However, prior works have not addressed all three jointly, and no large-scale dataset provides grounded annotations at all three levels across diverse visual scenarios. Hence, this work introduces the SCLARO (Scene-Contextual Localisation of Actions, Relations & Objects) dataset, consisting of 615,805 images spanning indoor, outdoor, and driving scenarios, annotated with global action captions, object bounding boxes, and relation triplets that supply structured scene context beyond a free-text caption. To benchmark the dataset, we propose ScenarioCLIP, a tri-level reference model that jointly encodes global scene context, objects, and inter-object relations using disentangled encoders and EMA-based knowledge distillation. We benchmark across a comprehensive suite of tasks on the SCLARO Dataset, namely zero-shot retrieval, linear probe, object detection, predicate classification, scene-graph classification, and out-of-domain generalisation. ScenarioCLIP's disentangled encoders improve over the previous works, such as PyramidCLIP's shared encoder, most notably at the object and relation levels and on out-of-domain generalisation. Code for the data generation pipeline and ScenarioCLIP is available at https://github.com/scenario-clip/SCLARO-ScenarioCLIP

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.