SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

Pengxin Xu , Xincheng Lin , Luping Xiao , Qing Jiang , Meishan Zhang , Hao Fei , Shanghang Zhang , Xingyu Chen

Authors on Pith no claims yet

classification 💻 cs.CV

keywords hierarchicalsceneparsingsceneparseraffordanceunderstandingcross-levelinteraction-oriented

read the original abstract

General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.

This paper has not been read by Pith yet.

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

discussion (0)