pith. sign in

arxiv: 2605.14923 · v1 · pith:D3PUDUKInew · submitted 2026-05-14 · 💻 cs.CV

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

Pith reviewed 2026-06-30 20:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords hierarchical scene parsingaffordance predictioncross-level bindingvision-language modelsscene understandingSceneParser-Benchstructural pseudo labelscurriculum learning
0
0 comments X

The pith

SceneParser generates explicit scene-object-part-affordance hierarchies with cross-level bindings in one unified pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Hierarchical Scene Parsing as the task of representing a physical scene as an explicit chain from scene to objects to parts to affordances, complete with bindings across levels. It presents SceneParser, a vision-language model trained to produce these hierarchies directly using structural-completion pseudo labels and curriculum learning. This matters because current open-vocabulary grounding and affordance methods produce isolated predictions that ignore the structured dependencies required for interaction. The authors supply a large benchmark built by a hierarchical data engine to train and measure this capability. Experiments indicate that existing multimodal models and stitched pipelines fall short on the new structure-aware metrics while SceneParser performs better and remains compatible with standard detection and affordance tasks.

Core claim

SceneParser is a VLM-based parser trained for unified hierarchical generation of scene-to-object-to-part-to-affordance structures. It relies on structural-completion pseudo labels produced by a scalable data engine together with curriculum learning, and it is evaluated with Level-1 to Level-3 conditional metrics plus ParseRate that separately score localization, cross-level binding, and hierarchical completeness. On the resulting SceneParser-Bench the model outperforms existing MLLMs and perception-stitching pipelines while also transferring to COCO and AGD20K and supporting downstream planning.

What carries the argument

unified hierarchical generation driven by structural-completion pseudo labels and curriculum learning

If this is right

  • SceneParser produces stronger localization, binding, and completeness scores than existing MLLMs on the SceneParser-Bench.
  • The same model remains compatible with conventional object detection and affordance tasks on COCO and AGD20K.
  • The generated hierarchies supply an actionable representation that improves performance in a downstream visual planning probe.
  • Level-1 to Level-3 conditional metrics together with ParseRate isolate the separate contributions of localization and cross-level structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotic systems could use the explicit chains to plan multi-step interactions without separately querying each perception module.
  • The hierarchical data engine could be reused to annotate other scene-understanding datasets that currently lack part-affordance links.
  • Training on full hierarchies may reduce error propagation in longer visual-reasoning chains compared with post-hoc stitching of separate models.

Load-bearing premise

The structural-completion pseudo labels produced by the hierarchical data engine correctly capture real-world cross-level bindings and affordances without systematic labeling errors.

What would settle it

If retraining SceneParser on the same images but with randomly shuffled or human-corrected object-part-affordance chains yields identical scores on the Level-1 to Level-3 metrics and ParseRate, the contribution of the pseudo-label engine would be falsified.

Figures

Figures reproduced from arXiv: 2605.14923 by Hao Fei, Luping Xiao, Meishan Zhang, Pengxin Xu, Qing Jiang, Shanghang Zhang, Xincheng Lin, Xingyu Chen.

Figure 1
Figure 1. Figure 1: SceneParser overview. Conventional perception produces local predictions, such as object boxes, part regions, and affordance points, which are often evaluated independently and leave cross-level relations unverified. In contrast, we parse a scene into an explicit scene → object → part → affordance hierarchy.Besides, our hierarchy-aware evaluation conditions lower-level correctness on upper-level matches, j… view at source ↗
Figure 2
Figure 2. Figure 2: Statistical overview of SceneParser-Bench. Left: object, part, and affordance/action word clouds from top to bottom. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SceneParser. SceneParser learns serialized object-part-affordance hierarchy targets and generates [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of hierarchy, affordance context, and cross-benchmark transfer. (a) Nested hierarchy outperforms flat [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the proposed data engine. Starting from an RGB scene image, the engine progressively constructs [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison among ground truth, Rex-Omni Stitching, and SceneParser. Compared with the stitching [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative downstream interaction-oriented reasoning probe. We compare a task-only prompt, which already [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
read the original abstract

General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Hierarchical Scene Parsing as a task that represents scenes via explicit scene→object→part→affordance hierarchies with cross-level bindings. It presents SceneParser, a VLM trained with structural-completion pseudo labels and curriculum learning, and constructs SceneParser-Bench (110K training images, 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations) via a scalable hierarchical data engine. Experiments claim that existing MLLMs and perception-stitching pipelines underperform on Level-1/Level-2/Level-3 conditional metrics and ParseRate, while SceneParser shows stronger structure-aware results; additional ablations, COCO/AGD20K evaluations, and a downstream planning probe are reported to show compatibility with conventional tasks.

Significance. If the gains on SceneParser-Bench reflect improved capture of real physical structure rather than replication of the data engine, the work could advance interaction-oriented scene representations for robotics and planning. The scale of the introduced benchmark and the new hierarchical metrics constitute a concrete contribution, and the external evaluations plus planning probe provide some grounding beyond the synthetic data.

major comments (2)
  1. [Abstract; SceneParser-Bench construction] Abstract (benchmark construction paragraph) and § on SceneParser-Bench: both the 110K training images and the 5K validation split are generated by the same 'scalable hierarchical data engine' that supplies the structural-completion pseudo labels used for training. No independent human ground truth is described for cross-level bindings or affordance chains. This makes the reported superiority on Level-1/2/3 metrics and ParseRate potentially circular, as higher scores may indicate better mimicry of the engine's labeling distribution rather than superior modeling of real-world scene structure.
  2. [Abstract; experimental results] Abstract (experiments paragraph): the central claim that 'SceneParser achieves stronger structure-aware performance' rests on comparisons whose quantitative values, baseline implementations, error bars, and ablation tables are not supplied in the abstract and must be verified against the full experimental section to confirm they are not driven by the shared data-generation process.
minor comments (1)
  1. [Abstract] The abstract states performance improvements without any numerical values, making the strength of the result difficult to gauge from the summary alone.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below, indicating where revisions will be made to improve clarity and transparency.

read point-by-point responses
  1. Referee: [Abstract; SceneParser-Bench construction] Abstract (benchmark construction paragraph) and § on SceneParser-Bench: both the 110K training images and the 5K validation split are generated by the same 'scalable hierarchical data engine' that supplies the structural-completion pseudo labels used for training. No independent human ground truth is described for cross-level bindings or affordance chains. This makes the reported superiority on Level-1/2/3 metrics and ParseRate potentially circular, as higher scores may indicate better mimicry of the engine's labeling distribution rather than superior modeling of real-world scene structure.

    Authors: We agree that both the training and validation portions of SceneParser-Bench are produced by the same hierarchical data engine used to generate pseudo labels, and that the manuscript does not describe independent human ground truth for the cross-level bindings or affordance chains. This creates a legitimate risk that performance gains reflect fidelity to the engine's labeling process rather than improved modeling of physical scene structure. We will revise the benchmark section and results discussion to explicitly acknowledge this limitation, describe the design principles of the data engine, and note that future work should include human validation. revision_made = 'yes'. revision: yes

  2. Referee: [Abstract; experimental results] Abstract (experiments paragraph): the central claim that 'SceneParser achieves stronger structure-aware performance' rests on comparisons whose quantitative values, baseline implementations, error bars, and ablation tables are not supplied in the abstract and must be verified against the full experimental section to confirm they are not driven by the shared data-generation process.

    Authors: The abstract provides only a high-level summary of the experimental outcomes. All quantitative results, baseline details, error bars, ablation studies, COCO/AGD20K evaluations, and the planning probe are reported in the full experimental section. The manuscript already contains the supporting tables and analysis needed to substantiate the claims. No changes to the abstract are necessary, though we can add a cross-reference if desired. revision_made = 'no'. revision: no

standing simulated objections not resolved
  • Absence of independent human ground-truth annotations for cross-level bindings and affordance chains in SceneParser-Bench; addressing this would require a separate large-scale human annotation campaign outside the current scope.

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on synthetic benchmark with external dataset support

full rationale

The paper describes an empirical VLM-based model trained with structural-completion pseudo labels from a hierarchical data engine and evaluated on SceneParser-Bench constructed from the same engine, plus ablations and results on external datasets COCO and AGD20K. No equations, fitted parameters, derivations, or self-citation chains are present in the abstract or described claims. Performance is presented as direct comparison rather than any reduction of outputs to inputs by construction. This is a standard empirical ML setup with no load-bearing circular steps matching the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, derivations, or modeling assumptions beyond the high-level description of pseudo-label generation and curriculum learning; ledger entries cannot be populated without full text.

pith-pipeline@v0.9.1-grok · 5808 in / 1161 out tokens · 23826 ms · 2026-06-30T20:50:04.064132+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 26 canonical work pages · 11 internal anchors

  1. [1]

    The pascal visual object classes (voc) challenge.IJCV, 2010

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.IJCV, 2010

  2. [2]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

  3. [3]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition(CVPR), 2017

  4. [4]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.URL https://arxiv. org/abs/2307.15818, 2024

  5. [5]

    Openeqa: Embodied question answering in the era of foundation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition(CVPR), 2024

  6. [6]

    Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai

    Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, et al. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

  7. [7]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision(ECCV), 2024

  8. [8]

    Grounding dino 1.5: Advance the edge of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

    Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xi- aoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the edge of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

  9. [9]

    Yolo-world: Real-time open- vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open- vocabulary object detection. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recogni- tion(CVPR), 2024

  10. [10]

    Florence-2: Advancing a unified representation for a variety of vision tasks

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

  11. [11]

    Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models

    Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, and Wei-Shi Zheng. Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

  12. [12]

    Towards visual grounding: A survey

    Linhui Xiao, Xiaoshan Yang, Xiangyuan Lan, Yaowei Wang, and Changsheng Xu. Towards visual grounding: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI), 2025

  13. [13]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition(CVPR), 2022

  14. [14]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023. 11 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

  15. [15]

    Ferret-v2: An improved baseline for referring and grounding with large language models.arXiv preprint arXiv:2404.07973, 2024

    Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, et al. Ferret-v2: An improved baseline for referring and grounding with large language models.arXiv preprint arXiv:2404.07973, 2024

  16. [16]

    Groma: Localized visual tokenization for grounding multimodal large language models

    Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. InEuropean Conference on Computer Vision(ECCV), 2024

  17. [17]

    Chatrex: Taming multimodal llm for joint perception and understanding.arXiv preprint arXiv:2411.18363, 2024

    Qing Jiang, Gen Luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, and Lei Zhang. Chatrex: Taming multimodal llm for joint perception and understanding.arXiv preprint arXiv:2411.18363, 2024

  18. [18]

    Detect anything via next point prediction.arXiv preprint arXiv:2510.12798, 2025

    Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction.arXiv preprint arXiv:2510.12798, 2025

  19. [19]

    VGR: Visual Grounded Reasoning

    Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

  20. [20]

    Partimagenet: A large, high-quality dataset of parts

    Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xiaoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qihang Yu, and Alan Yuille. Partimagenet: A large, high-quality dataset of parts. InEuropean Conference on Computer Vision(ECCV), 2022

  21. [21]

    Paco: Parts and attributes of common objects

    Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts and attributes of common objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2023

  22. [22]

    Learning affordance grounding from exocentric images

    Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Dacheng Tao. Learning affordance grounding from exocentric images. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2022

  23. [23]

    Affordancellm: Grounding affordance from vision language models

    Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, and Li Erran Li. Affordancellm: Grounding affordance from vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

  24. [24]

    Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

  25. [25]

    Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models

    Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, and Yuexin Ma. Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence(AAAI), 2026

  26. [26]

    Digital gene: Learning about the physical world through analytic concepts.arXiv preprint arXiv:2504.04170, 2025

    Jianhua Sun and Cewu Lu. Digital gene: Learning about the physical world through analytic concepts.arXiv preprint arXiv:2504.04170, 2025

  27. [27]

    Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu

    Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

  28. [28]

    Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

    Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, and Miao Liu. Unleashing spatial reasoning in multimodal large language models via textual representation guided reasoning.arXiv preprint arXiv:2603.23404, 2026

  29. [29]

    Thinking with visual primitives

    Ruijie Lu, Yiyang Ma, Xiaokang Chen, Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, Yutong Lin, Hao Li, Wen Liu, Zhewen Hao, Xi Gao, Shaoheng Nie, Yixuan Wei, Zhenda Xie, Ting Chen, and Gang Zeng. Thinking with visual primitives. https://github.com/deepseek-ai/Thinking-with-Visual-Primitives , 2026. Technical report

  30. [30]

    You only look once: Unified, real-time object detection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition(CVPR), 2016. 12 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

  31. [31]

    Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in Neural Information Processing Systems, 2015

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in Neural Information Processing Systems, 2015

  32. [32]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision(ECCV), 2020

  33. [33]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022

  34. [34]

    Dino-x: A unified vision model for open-world object detection and understanding.arXiv preprint arXiv:2411.14347, 2024

    Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, et al. Dino-x: A unified vision model for open-world object detection and understanding.arXiv preprint arXiv:2411.14347, 2024

  35. [35]

    Yoloe: Real-time seeing anything

    Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yoloe: Real-time seeing anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), 2025

  36. [36]

    T-rex2: Towards generic object detection via text-visual prompt synergy

    Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-rex2: Towards generic object detection via text-visual prompt synergy. InEuropean Conference on Computer Vision(ECCV), 2024

  37. [37]

    Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems(NeuIPS), 2023

    Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems(NeuIPS), 2023

  38. [38]

    Language-conditioned detection transformer

    Jang Hyun Cho and Philipp Krähenbühl. Language-conditioned detection transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

  39. [39]

    arXiv preprint arXiv:2109.10852 , year=

    Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection.arXiv preprint arXiv:2109.10852, 2021

  40. [40]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023

  41. [41]

    Ferret: Refer and Ground Anything Anywhere at Any Granularity

    Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023

  42. [42]

    Glamm: Pixel grounding large multimodal model

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

  43. [43]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition(CVPR), 2024

  44. [44]

    Detect what you can: Detecting and representing objects using holistic models and body parts

    Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014

  45. [45]

    Going denser with open- vocabulary part segmentation

    Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, and Zhicheng Yan. Going denser with open- vocabulary part segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), 2023

  46. [46]

    Understanding multi-granularity for open- vocabulary part segmentation.Advances in Neural Information Processing Systems, 2024

    Jiho Choi, Seonho Lee, Seungho Lee, Minhyun Lee, and Hyunjung Shim. Understanding multi-granularity for open- vocabulary part segmentation.Advances in Neural Information Processing Systems, 2024

  47. [47]

    Partglee: A foundation model for recognizing and parsing any objects

    Junyi Li, Junfeng Wu, Weizhi Zhao, Song Bai, and Xiang Bai. Partglee: A foundation model for recognizing and parsing any objects. InEuropean Conference on Computer Vision(ECCV), 2024. 13 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

  48. [48]

    Instructpart: Task-oriented part segmentation with instruction reasoning

    Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, and Katia P Sycara. Instructpart: Task-oriented part segmentation with instruction reasoning. InProceedings of the Meeting of the Association for Computational Linguistics, 2025

  49. [49]

    Affordance detection of tool parts from geometric features

    Austin Myers, Ching L Teo, Cornelia Fermüller, and Yiannis Aloimonos. Affordance detection of tool parts from geometric features. InIEEE international conference on robotics and automation(ICRA), 2015

  50. [50]

    Object-based affordances detection with convolutional neural networks and dense conditional random fields

    Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and Nikos G Tsagarakis. Object-based affordances detection with convolutional neural networks and dense conditional random fields. InIEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), 2017

  51. [51]

    Maskprompt: open-vocabulary affordance segmentation with object shape mask prompts

    Dongpan Chen, Dehui Kong, Jinghua Li, and Baocai Yin. Maskprompt: open-vocabulary affordance segmentation with object shape mask prompts. InProceedings of the AAAI Conference on Artificial Intelligence(AAAI), 2025

  52. [52]

    Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter

    Yingbo Tang, Shuaike Zhang, Xiaoshuai Hao, Pengwei Wang, Jianlong Wu, Zhongyuan Wang, and Shanghang Zhang. Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), 2025

  53. [53]

    Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

    Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

  54. [54]

    Robobrain: A unified brain model for robotic manipulation from abstract to concrete

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2025

  55. [55]

    A4-agent: An agentic framework for zero-shot affordance reasoning.arXiv preprint arXiv:2512.14442, 2025

    Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, and Ying-Cong Chen. A4-agent: An agentic framework for zero-shot affordance reasoning.arXiv preprint arXiv:2512.14442, 2025

  56. [56]

    Glover++: Unleashing the potential of affordance learning from human behaviors for robotic manipulation.arXiv preprint arXiv:2505.11865, 2025

    Teli Ma, Jia Zheng, Zifan Wang, Ziyao Gao, Jiaming Zhou, and Junwei Liang. Glover++: Unleashing the potential of affordance learning from human behaviors for robotic manipulation.arXiv preprint arXiv:2505.11865, 2025

  57. [57]

    3d-affordancellm: Har- nessing large language models for open-vocabulary affordance detection in 3d worlds.arXiv preprint arXiv:2502.20041, 2025

    Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye Hao, and Liqiang Nie. 3d-affordancellm: Har- nessing large language models for open-vocabulary affordance detection in 3d worlds.arXiv preprint arXiv:2502.20041, 2025

  58. [58]

    Learning precise affordances from egocentric videos for robotic manipulation

    Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, Kun Shao, and Laura Sevilla- Lara. Learning precise affordances from egocentric videos for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), 2025

  59. [59]

    Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009, 2025

    Junha Lee, Eunha Park, Chunghyun Park, Dahyun Kang, and Minsu Cho. Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009, 2025

  60. [60]

    Gpt-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

    OpenAI. Gpt-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

  61. [61]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

  62. [62]

    Egoobjects: A large-scale egocentric dataset for fine-grained object understanding

    Chenchen Zhu, Fanyi Xiao, Andrés Alvarado, Yasmine Babaei, Jiabo Hu, Hichem El-Mohri, Sean Culatana, Roshan Sumbaly, and Zhicheng Yan. Egoobjects: A large-scale egocentric dataset for fine-grained object understanding. In Proceedings of the IEEE/CVF international conference on computer vision(ICCV), 2023

  63. [63]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the International Conference on Machine Learning, 2009

  64. [64]

    Self-paced learning for latent variable models.Advances in Neural Information Processing Systems(NeuIPS), 2010

    M Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models.Advances in Neural Information Processing Systems(NeuIPS), 2010

  65. [65]

    Competence-based curriculum learning for neural machine translation

    Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell. Competence-based curriculum learning for neural machine translation. InProceedings of the conference of the North American chapter of the association for computational linguistics, 2019

  66. [66]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems(NeuIPS), 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems(NeuIPS), 2023

  67. [67]

    Curriculum learning with quality-driven data selection.arXiv preprint arXiv:2407.00102, 2024

    Biao Wu and Ling Chen. Curriculum learning with quality-driven data selection.arXiv preprint arXiv:2407.00102, 2024

  68. [68]

    Gemini 3.1 pro model card

    Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026

  69. [69]

    Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, 2026

  70. [70]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  71. [71]

    Qwen3-VL Technical Report

    Qwen Team. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  72. [72]

    Seed2.0 model card: Towards intelligence frontier for real-world complexity

    ByteDance Seed Team. Seed2.0 model card: Towards intelligence frontier for real-world complexity. ByteDance Seed technical report, 2026

  73. [73]

    Seed1.5-VL Technical Report

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 15 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding A Formal Definition of Evaluation Metrics Overview.We provide the formal definition of ...

  74. [74]

    door”: “white door

    Scene - level Object Grounding 1. Scene - level Object Description (GPT5) Scene - level object noun and referring expressions • “door”: “white door” • “ cookto p ” : “black cooktop” • “ socket ” : “white socket” • “ counte r ” : “kitchen counter with drawer” 2 . Object Grounding& Merging 2 . Object - level Part Parsing Object Bounding Box 3. Object Croppi...

  75. [75]

    whisk handl e

    Affordance Parsing 7. Affordance Region D escription (GPT5) • “ whisk handl e ” • “ door handl e ” • “ rim of the bowl ” • “ s ocket button ” 8 . Affordance Segmentation (SAM3) Hierarchical Scene Scene | -- Object: Socket | | -- Bbox | | -- Part: Button | | | -- Bbox | | | -- Affordance | | | -- Action | | | -- Point | | -- Part: Button | | -- Bbox | | --...

  76. [76]

    Parse all visible interactive objects

  77. [77]

    If no valid object, return {"objects":[]}

  78. [78]

    Always include parts and affordances arrays (empty if none)

  79. [79]

    Do not output extra keys or explanations

  80. [80]

    string",

    No placeholder literals like "string", "object", "part", "action". 19 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding F Qualitative Visualization GT obj:glaves obj:stove part:stovetop part:stovetop part:handel obj:basket obj:bag obj:bottle obj:drawer obj:cupboard obj:drawer obj:spatula obj:rag obj:bowl obj:ladle obj:mixer part:h...