SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

Hao Fei; Luping Xiao; Meishan Zhang; Pengxin Xu; Qing Jiang; Shanghang Zhang; Xincheng Lin; Xingyu Chen

arxiv: 2605.14923 · v1 · pith:D3PUDUKInew · submitted 2026-05-14 · 💻 cs.CV

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

Pengxin Xu , Xincheng Lin , Luping Xiao , Qing Jiang , Meishan Zhang , Hao Fei , Shanghang Zhang , Xingyu Chen This is my paper

Pith reviewed 2026-06-30 20:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords hierarchical scene parsingaffordance predictioncross-level bindingvision-language modelsscene understandingSceneParser-Benchstructural pseudo labelscurriculum learning

0 comments

The pith

SceneParser generates explicit scene-object-part-affordance hierarchies with cross-level bindings in one unified pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Hierarchical Scene Parsing as the task of representing a physical scene as an explicit chain from scene to objects to parts to affordances, complete with bindings across levels. It presents SceneParser, a vision-language model trained to produce these hierarchies directly using structural-completion pseudo labels and curriculum learning. This matters because current open-vocabulary grounding and affordance methods produce isolated predictions that ignore the structured dependencies required for interaction. The authors supply a large benchmark built by a hierarchical data engine to train and measure this capability. Experiments indicate that existing multimodal models and stitched pipelines fall short on the new structure-aware metrics while SceneParser performs better and remains compatible with standard detection and affordance tasks.

Core claim

SceneParser is a VLM-based parser trained for unified hierarchical generation of scene-to-object-to-part-to-affordance structures. It relies on structural-completion pseudo labels produced by a scalable data engine together with curriculum learning, and it is evaluated with Level-1 to Level-3 conditional metrics plus ParseRate that separately score localization, cross-level binding, and hierarchical completeness. On the resulting SceneParser-Bench the model outperforms existing MLLMs and perception-stitching pipelines while also transferring to COCO and AGD20K and supporting downstream planning.

What carries the argument

unified hierarchical generation driven by structural-completion pseudo labels and curriculum learning

If this is right

SceneParser produces stronger localization, binding, and completeness scores than existing MLLMs on the SceneParser-Bench.
The same model remains compatible with conventional object detection and affordance tasks on COCO and AGD20K.
The generated hierarchies supply an actionable representation that improves performance in a downstream visual planning probe.
Level-1 to Level-3 conditional metrics together with ParseRate isolate the separate contributions of localization and cross-level structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robotic systems could use the explicit chains to plan multi-step interactions without separately querying each perception module.
The hierarchical data engine could be reused to annotate other scene-understanding datasets that currently lack part-affordance links.
Training on full hierarchies may reduce error propagation in longer visual-reasoning chains compared with post-hoc stitching of separate models.

Load-bearing premise

The structural-completion pseudo labels produced by the hierarchical data engine correctly capture real-world cross-level bindings and affordances without systematic labeling errors.

What would settle it

If retraining SceneParser on the same images but with randomly shuffled or human-corrected object-part-affordance chains yields identical scores on the Level-1 to Level-3 metrics and ParseRate, the contribution of the pseudo-label engine would be falsified.

Figures

Figures reproduced from arXiv: 2605.14923 by Hao Fei, Luping Xiao, Meishan Zhang, Pengxin Xu, Qing Jiang, Shanghang Zhang, Xincheng Lin, Xingyu Chen.

**Figure 1.** Figure 1: SceneParser overview. Conventional perception produces local predictions, such as object boxes, part regions, and affordance points, which are often evaluated independently and leave cross-level relations unverified. In contrast, we parse a scene into an explicit scene → object → part → affordance hierarchy.Besides, our hierarchy-aware evaluation conditions lower-level correctness on upper-level matches, j… view at source ↗

**Figure 2.** Figure 2: Statistical overview of SceneParser-Bench. Left: object, part, and affordance/action word clouds from top to bottom. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of SceneParser. SceneParser learns serialized object-part-affordance hierarchy targets and generates [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Analysis of hierarchy, affordance context, and cross-benchmark transfer. (a) Nested hierarchy outperforms flat [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of the proposed data engine. Starting from an RGB scene image, the engine progressively constructs [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison among ground truth, Rex-Omni Stitching, and SceneParser. Compared with the stitching [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative downstream interaction-oriented reasoning probe. We compare a task-only prompt, which already [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The hierarchical task and benchmark are new, but the self-generated labels create a real risk that reported gains just track the data engine rather than real scene structure.

read the letter

The paper defines hierarchical scene parsing as explicit object-part-affordance chains with cross-level bindings and builds SceneParser-Bench around it. That task framing and the ParseRate metric plus the Level-1/2/3 conditional scores look like the main additions relative to prior isolated grounding or affordance papers. The 110K-image scale and the downstream planning probe are also concrete steps forward.

What the work does reasonably is pull together several strands that usually sit in separate papers and give them a single training and evaluation setup. The curriculum learning and structural-completion pseudo labels are sensible engineering choices for a VLM-based parser.

The soft spot is exactly the one in the stress-test note. Both training and the 5K validation split come from the same scalable hierarchical data engine. No independent human ground truth is mentioned for the affordance chains or cross-level bindings. If the engine has systematic biases on rare configurations or ambiguous boundaries, then better ParseRate on the benchmark mainly shows the model reproduced the engine's distribution, not that it captures physical structure better. The abstract gives no numbers, error bars, or baseline tables, so the central claim stays uncheckable from the provided text.

This is aimed at robotics and embodied-AI groups that already stitch object, part, and affordance predictors and want a more unified representation. It deserves peer review because the task motivation is clear and the benchmark size is substantial, but any referee will have to press hard on label quality and external validation before the performance numbers can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The paper introduces Hierarchical Scene Parsing as a task that represents scenes via explicit scene→object→part→affordance hierarchies with cross-level bindings. It presents SceneParser, a VLM trained with structural-completion pseudo labels and curriculum learning, and constructs SceneParser-Bench (110K training images, 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations) via a scalable hierarchical data engine. Experiments claim that existing MLLMs and perception-stitching pipelines underperform on Level-1/Level-2/Level-3 conditional metrics and ParseRate, while SceneParser shows stronger structure-aware results; additional ablations, COCO/AGD20K evaluations, and a downstream planning probe are reported to show compatibility with conventional tasks.

Significance. If the gains on SceneParser-Bench reflect improved capture of real physical structure rather than replication of the data engine, the work could advance interaction-oriented scene representations for robotics and planning. The scale of the introduced benchmark and the new hierarchical metrics constitute a concrete contribution, and the external evaluations plus planning probe provide some grounding beyond the synthetic data.

major comments (2)

[Abstract; SceneParser-Bench construction] Abstract (benchmark construction paragraph) and § on SceneParser-Bench: both the 110K training images and the 5K validation split are generated by the same 'scalable hierarchical data engine' that supplies the structural-completion pseudo labels used for training. No independent human ground truth is described for cross-level bindings or affordance chains. This makes the reported superiority on Level-1/2/3 metrics and ParseRate potentially circular, as higher scores may indicate better mimicry of the engine's labeling distribution rather than superior modeling of real-world scene structure.
[Abstract; experimental results] Abstract (experiments paragraph): the central claim that 'SceneParser achieves stronger structure-aware performance' rests on comparisons whose quantitative values, baseline implementations, error bars, and ablation tables are not supplied in the abstract and must be verified against the full experimental section to confirm they are not driven by the shared data-generation process.

minor comments (1)

[Abstract] The abstract states performance improvements without any numerical values, making the strength of the result difficult to gauge from the summary alone.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below, indicating where revisions will be made to improve clarity and transparency.

read point-by-point responses

Referee: [Abstract; SceneParser-Bench construction] Abstract (benchmark construction paragraph) and § on SceneParser-Bench: both the 110K training images and the 5K validation split are generated by the same 'scalable hierarchical data engine' that supplies the structural-completion pseudo labels used for training. No independent human ground truth is described for cross-level bindings or affordance chains. This makes the reported superiority on Level-1/2/3 metrics and ParseRate potentially circular, as higher scores may indicate better mimicry of the engine's labeling distribution rather than superior modeling of real-world scene structure.

Authors: We agree that both the training and validation portions of SceneParser-Bench are produced by the same hierarchical data engine used to generate pseudo labels, and that the manuscript does not describe independent human ground truth for the cross-level bindings or affordance chains. This creates a legitimate risk that performance gains reflect fidelity to the engine's labeling process rather than improved modeling of physical scene structure. We will revise the benchmark section and results discussion to explicitly acknowledge this limitation, describe the design principles of the data engine, and note that future work should include human validation. revision_made = 'yes'. revision: yes
Referee: [Abstract; experimental results] Abstract (experiments paragraph): the central claim that 'SceneParser achieves stronger structure-aware performance' rests on comparisons whose quantitative values, baseline implementations, error bars, and ablation tables are not supplied in the abstract and must be verified against the full experimental section to confirm they are not driven by the shared data-generation process.

Authors: The abstract provides only a high-level summary of the experimental outcomes. All quantitative results, baseline details, error bars, ablation studies, COCO/AGD20K evaluations, and the planning probe are reported in the full experimental section. The manuscript already contains the supporting tables and analysis needed to substantiate the claims. No changes to the abstract are necessary, though we can add a cross-reference if desired. revision_made = 'no'. revision: no

standing simulated objections not resolved

Absence of independent human ground-truth annotations for cross-level bindings and affordance chains in SceneParser-Bench; addressing this would require a separate large-scale human annotation campaign outside the current scope.

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on synthetic benchmark with external dataset support

full rationale

The paper describes an empirical VLM-based model trained with structural-completion pseudo labels from a hierarchical data engine and evaluated on SceneParser-Bench constructed from the same engine, plus ablations and results on external datasets COCO and AGD20K. No equations, fitted parameters, derivations, or self-citation chains are present in the abstract or described claims. Performance is presented as direct comparison rather than any reduction of outputs to inputs by construction. This is a standard empirical ML setup with no load-bearing circular steps matching the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, derivations, or modeling assumptions beyond the high-level description of pseudo-label generation and curriculum learning; ledger entries cannot be populated without full text.

pith-pipeline@v0.9.1-grok · 5808 in / 1161 out tokens · 23826 ms · 2026-06-30T20:50:04.064132+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 26 canonical work pages · 11 internal anchors

[1]

The pascal visual object classes (voc) challenge.IJCV, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.IJCV, 2010

2010
[2]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

2014
[3]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition(CVPR), 2017

2017
[4]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.URL https://arxiv. org/abs/2307.15818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024
[6]

Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai

Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, et al. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024
[7]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision(ECCV), 2024

2024
[8]

Grounding dino 1.5: Advance the edge of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xi- aoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the edge of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

work page arXiv 2024
[9]

Yolo-world: Real-time open- vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open- vocabulary object detection. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recogni- tion(CVPR), 2024

2024
[10]

Florence-2: Advancing a unified representation for a variety of vision tasks

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024
[11]

Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models

Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, and Wei-Shi Zheng. Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

2025
[12]

Towards visual grounding: A survey

Linhui Xiao, Xiaoshan Yang, Xiangyuan Lan, Yaowei Wang, and Changsheng Xu. Towards visual grounding: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI), 2025

2025
[13]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition(CVPR), 2022

2022
[14]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023. 11 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Ferret-v2: An improved baseline for referring and grounding with large language models.arXiv preprint arXiv:2404.07973, 2024

Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, et al. Ferret-v2: An improved baseline for referring and grounding with large language models.arXiv preprint arXiv:2404.07973, 2024

work page arXiv 2024
[16]

Groma: Localized visual tokenization for grounding multimodal large language models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. InEuropean Conference on Computer Vision(ECCV), 2024

2024
[17]

Chatrex: Taming multimodal llm for joint perception and understanding.arXiv preprint arXiv:2411.18363, 2024

Qing Jiang, Gen Luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, and Lei Zhang. Chatrex: Taming multimodal llm for joint perception and understanding.arXiv preprint arXiv:2411.18363, 2024

work page arXiv 2024
[18]

Detect anything via next point prediction.arXiv preprint arXiv:2510.12798, 2025

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction.arXiv preprint arXiv:2510.12798, 2025

work page arXiv 2025
[19]

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Partimagenet: A large, high-quality dataset of parts

Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xiaoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qihang Yu, and Alan Yuille. Partimagenet: A large, high-quality dataset of parts. InEuropean Conference on Computer Vision(ECCV), 2022

2022
[21]

Paco: Parts and attributes of common objects

Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts and attributes of common objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2023

2023
[22]

Learning affordance grounding from exocentric images

Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Dacheng Tao. Learning affordance grounding from exocentric images. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2022

2022
[23]

Affordancellm: Grounding affordance from vision language models

Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, and Li Erran Li. Affordancellm: Grounding affordance from vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024
[24]

Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

work page arXiv 2024
[25]

Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, and Yuexin Ma. Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence(AAAI), 2026

2026
[26]

Digital gene: Learning about the physical world through analytic concepts.arXiv preprint arXiv:2504.04170, 2025

Jianhua Sun and Cewu Lu. Digital gene: Learning about the physical world through analytic concepts.arXiv preprint arXiv:2504.04170, 2025

work page arXiv 2025
[27]

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

work page arXiv 2024
[28]

Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, and Miao Liu. Unleashing spatial reasoning in multimodal large language models via textual representation guided reasoning.arXiv preprint arXiv:2603.23404, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Thinking with visual primitives

Ruijie Lu, Yiyang Ma, Xiaokang Chen, Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, Yutong Lin, Hao Li, Wen Liu, Zhewen Hao, Xi Gao, Shaoheng Nie, Yixuan Wei, Zhenda Xie, Ting Chen, and Gang Zeng. Thinking with visual primitives. https://github.com/deepseek-ai/Thinking-with-Visual-Primitives , 2026. Technical report

2026
[30]

You only look once: Unified, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition(CVPR), 2016. 12 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

2016
[31]

Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in Neural Information Processing Systems, 2015

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in Neural Information Processing Systems, 2015

2015
[32]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision(ECCV), 2020

2020
[33]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Dino-x: A unified vision model for open-world object detection and understanding.arXiv preprint arXiv:2411.14347, 2024

Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, et al. Dino-x: A unified vision model for open-world object detection and understanding.arXiv preprint arXiv:2411.14347, 2024

work page arXiv 2024
[35]

Yoloe: Real-time seeing anything

Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yoloe: Real-time seeing anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), 2025

2025
[36]

T-rex2: Towards generic object detection via text-visual prompt synergy

Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-rex2: Towards generic object detection via text-visual prompt synergy. InEuropean Conference on Computer Vision(ECCV), 2024

2024
[37]

Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems(NeuIPS), 2023

Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems(NeuIPS), 2023

2023
[38]

Language-conditioned detection transformer

Jang Hyun Cho and Philipp Krähenbühl. Language-conditioned detection transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024
[39]

arXiv preprint arXiv:2109.10852 , year=

Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection.arXiv preprint arXiv:2109.10852, 2021

work page arXiv 2021
[40]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024
[43]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024
[44]

Detect what you can: Detecting and representing objects using holistic models and body parts

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014

2014
[45]

Going denser with open- vocabulary part segmentation

Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, and Zhicheng Yan. Going denser with open- vocabulary part segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), 2023

2023
[46]

Understanding multi-granularity for open- vocabulary part segmentation.Advances in Neural Information Processing Systems, 2024

Jiho Choi, Seonho Lee, Seungho Lee, Minhyun Lee, and Hyunjung Shim. Understanding multi-granularity for open- vocabulary part segmentation.Advances in Neural Information Processing Systems, 2024

2024
[47]

Partglee: A foundation model for recognizing and parsing any objects

Junyi Li, Junfeng Wu, Weizhi Zhao, Song Bai, and Xiang Bai. Partglee: A foundation model for recognizing and parsing any objects. InEuropean Conference on Computer Vision(ECCV), 2024. 13 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

2024
[48]

Instructpart: Task-oriented part segmentation with instruction reasoning

Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, and Katia P Sycara. Instructpart: Task-oriented part segmentation with instruction reasoning. InProceedings of the Meeting of the Association for Computational Linguistics, 2025

2025
[49]

Affordance detection of tool parts from geometric features

Austin Myers, Ching L Teo, Cornelia Fermüller, and Yiannis Aloimonos. Affordance detection of tool parts from geometric features. InIEEE international conference on robotics and automation(ICRA), 2015

2015
[50]

Object-based affordances detection with convolutional neural networks and dense conditional random fields

Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and Nikos G Tsagarakis. Object-based affordances detection with convolutional neural networks and dense conditional random fields. InIEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), 2017

2017
[51]

Maskprompt: open-vocabulary affordance segmentation with object shape mask prompts

Dongpan Chen, Dehui Kong, Jinghua Li, and Baocai Yin. Maskprompt: open-vocabulary affordance segmentation with object shape mask prompts. InProceedings of the AAAI Conference on Artificial Intelligence(AAAI), 2025

2025
[52]

Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter

Yingbo Tang, Shuaike Zhang, Xiaoshuai Hao, Pengwei Wang, Jianlong Wu, Zhongyuan Wang, and Shanghang Zhang. Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), 2025

2025
[53]

Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

work page arXiv 2025
[54]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2025

2025
[55]

A4-agent: An agentic framework for zero-shot affordance reasoning.arXiv preprint arXiv:2512.14442, 2025

Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, and Ying-Cong Chen. A4-agent: An agentic framework for zero-shot affordance reasoning.arXiv preprint arXiv:2512.14442, 2025

work page arXiv 2025
[56]

Glover++: Unleashing the potential of affordance learning from human behaviors for robotic manipulation.arXiv preprint arXiv:2505.11865, 2025

Teli Ma, Jia Zheng, Zifan Wang, Ziyao Gao, Jiaming Zhou, and Junwei Liang. Glover++: Unleashing the potential of affordance learning from human behaviors for robotic manipulation.arXiv preprint arXiv:2505.11865, 2025

work page arXiv 2025
[57]

3d-affordancellm: Har- nessing large language models for open-vocabulary affordance detection in 3d worlds.arXiv preprint arXiv:2502.20041, 2025

Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye Hao, and Liqiang Nie. 3d-affordancellm: Har- nessing large language models for open-vocabulary affordance detection in 3d worlds.arXiv preprint arXiv:2502.20041, 2025

work page arXiv 2025
[58]

Learning precise affordances from egocentric videos for robotic manipulation

Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, Kun Shao, and Laura Sevilla- Lara. Learning precise affordances from egocentric videos for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), 2025

2025
[59]

Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009, 2025

Junha Lee, Eunha Park, Chunghyun Park, Dahyun Kang, and Minsu Cho. Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009, 2025

work page arXiv 2025
[60]

Gpt-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

OpenAI. Gpt-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

2025
[61]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Egoobjects: A large-scale egocentric dataset for fine-grained object understanding

Chenchen Zhu, Fanyi Xiao, Andrés Alvarado, Yasmine Babaei, Jiabo Hu, Hichem El-Mohri, Sean Culatana, Roshan Sumbaly, and Zhicheng Yan. Egoobjects: A large-scale egocentric dataset for fine-grained object understanding. In Proceedings of the IEEE/CVF international conference on computer vision(ICCV), 2023

2023
[63]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the International Conference on Machine Learning, 2009

2009
[64]

Self-paced learning for latent variable models.Advances in Neural Information Processing Systems(NeuIPS), 2010

M Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models.Advances in Neural Information Processing Systems(NeuIPS), 2010

2010
[65]

Competence-based curriculum learning for neural machine translation

Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell. Competence-based curriculum learning for neural machine translation. InProceedings of the conference of the North American chapter of the association for computational linguistics, 2019

2019
[66]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems(NeuIPS), 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems(NeuIPS), 2023

2023
[67]

Curriculum learning with quality-driven data selection.arXiv preprint arXiv:2407.00102, 2024

Biao Wu and Ling Chen. Curriculum learning with quality-driven data selection.arXiv preprint arXiv:2407.00102, 2024

work page arXiv 2024
[68]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026

2026
[69]

Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, 2026

Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, 2026

2026
[70]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Qwen3-VL Technical Report

Qwen Team. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Seed2.0 model card: Towards intelligence frontier for real-world complexity

ByteDance Seed Team. Seed2.0 model card: Towards intelligence frontier for real-world complexity. ByteDance Seed technical report, 2026

2026
[73]

Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 15 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding A Formal Definition of Evaluation Metrics Overview.We provide the formal definition of ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

door”: “white door

Scene - level Object Grounding 1. Scene - level Object Description (GPT5) Scene - level object noun and referring expressions • “door”: “white door” • “ cookto p ” : “black cooktop” • “ socket ” : “white socket” • “ counte r ” : “kitchen counter with drawer” 2 . Object Grounding& Merging 2 . Object - level Part Parsing Object Bounding Box 3. Object Croppi...
[75]

whisk handl e

Affordance Parsing 7. Affordance Region D escription (GPT5) • “ whisk handl e ” • “ door handl e ” • “ rim of the bowl ” • “ s ocket button ” 8 . Affordance Segmentation (SAM3) Hierarchical Scene Scene | -- Object: Socket | | -- Bbox | | -- Part: Button | | | -- Bbox | | | -- Affordance | | | -- Action | | | -- Point | | -- Part: Button | | -- Bbox | | --...
[76]

Parse all visible interactive objects
[77]

If no valid object, return {"objects":[]}
[78]

Always include parts and affordances arrays (empty if none)
[79]

Do not output extra keys or explanations
[80]

string",

No placeholder literals like "string", "object", "part", "action". 19 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding F Qualitative Visualization GT obj:glaves obj:stove part:stovetop part:stovetop part:handel obj:basket obj:bag obj:bottle obj:drawer obj:cupboard obj:drawer obj:spatula obj:rag obj:bowl obj:ladle obj:mixer part:h...

[1] [1]

The pascal visual object classes (voc) challenge.IJCV, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.IJCV, 2010

2010

[2] [2]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

2014

[3] [3]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition(CVPR), 2017

2017

[4] [4]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.URL https://arxiv. org/abs/2307.15818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024

[6] [6]

Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai

Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, et al. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024

[7] [7]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision(ECCV), 2024

2024

[8] [8]

Grounding dino 1.5: Advance the edge of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xi- aoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the edge of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

work page arXiv 2024

[9] [9]

Yolo-world: Real-time open- vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open- vocabulary object detection. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recogni- tion(CVPR), 2024

2024

[10] [10]

Florence-2: Advancing a unified representation for a variety of vision tasks

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024

[11] [11]

Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models

Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, and Wei-Shi Zheng. Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

2025

[12] [12]

Towards visual grounding: A survey

Linhui Xiao, Xiaoshan Yang, Xiangyuan Lan, Yaowei Wang, and Changsheng Xu. Towards visual grounding: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI), 2025

2025

[13] [13]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition(CVPR), 2022

2022

[14] [14]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023. 11 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Ferret-v2: An improved baseline for referring and grounding with large language models.arXiv preprint arXiv:2404.07973, 2024

Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, et al. Ferret-v2: An improved baseline for referring and grounding with large language models.arXiv preprint arXiv:2404.07973, 2024

work page arXiv 2024

[16] [16]

Groma: Localized visual tokenization for grounding multimodal large language models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. InEuropean Conference on Computer Vision(ECCV), 2024

2024

[17] [17]

Chatrex: Taming multimodal llm for joint perception and understanding.arXiv preprint arXiv:2411.18363, 2024

Qing Jiang, Gen Luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, and Lei Zhang. Chatrex: Taming multimodal llm for joint perception and understanding.arXiv preprint arXiv:2411.18363, 2024

work page arXiv 2024

[18] [18]

Detect anything via next point prediction.arXiv preprint arXiv:2510.12798, 2025

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction.arXiv preprint arXiv:2510.12798, 2025

work page arXiv 2025

[19] [19]

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Partimagenet: A large, high-quality dataset of parts

Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xiaoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qihang Yu, and Alan Yuille. Partimagenet: A large, high-quality dataset of parts. InEuropean Conference on Computer Vision(ECCV), 2022

2022

[21] [21]

Paco: Parts and attributes of common objects

Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts and attributes of common objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2023

2023

[22] [22]

Learning affordance grounding from exocentric images

Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Dacheng Tao. Learning affordance grounding from exocentric images. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2022

2022

[23] [23]

Affordancellm: Grounding affordance from vision language models

Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, and Li Erran Li. Affordancellm: Grounding affordance from vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024

[24] [24]

Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

work page arXiv 2024

[25] [25]

Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, and Yuexin Ma. Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence(AAAI), 2026

2026

[26] [26]

Digital gene: Learning about the physical world through analytic concepts.arXiv preprint arXiv:2504.04170, 2025

Jianhua Sun and Cewu Lu. Digital gene: Learning about the physical world through analytic concepts.arXiv preprint arXiv:2504.04170, 2025

work page arXiv 2025

[27] [27]

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

work page arXiv 2024

[28] [28]

Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, and Miao Liu. Unleashing spatial reasoning in multimodal large language models via textual representation guided reasoning.arXiv preprint arXiv:2603.23404, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Thinking with visual primitives

Ruijie Lu, Yiyang Ma, Xiaokang Chen, Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, Yutong Lin, Hao Li, Wen Liu, Zhewen Hao, Xi Gao, Shaoheng Nie, Yixuan Wei, Zhenda Xie, Ting Chen, and Gang Zeng. Thinking with visual primitives. https://github.com/deepseek-ai/Thinking-with-Visual-Primitives , 2026. Technical report

2026

[30] [30]

You only look once: Unified, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition(CVPR), 2016. 12 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

2016

[31] [31]

Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in Neural Information Processing Systems, 2015

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in Neural Information Processing Systems, 2015

2015

[32] [32]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision(ECCV), 2020

2020

[33] [33]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Dino-x: A unified vision model for open-world object detection and understanding.arXiv preprint arXiv:2411.14347, 2024

Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, et al. Dino-x: A unified vision model for open-world object detection and understanding.arXiv preprint arXiv:2411.14347, 2024

work page arXiv 2024

[35] [35]

Yoloe: Real-time seeing anything

Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yoloe: Real-time seeing anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), 2025

2025

[36] [36]

T-rex2: Towards generic object detection via text-visual prompt synergy

Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-rex2: Towards generic object detection via text-visual prompt synergy. InEuropean Conference on Computer Vision(ECCV), 2024

2024

[37] [37]

Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems(NeuIPS), 2023

Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems(NeuIPS), 2023

2023

[38] [38]

Language-conditioned detection transformer

Jang Hyun Cho and Philipp Krähenbühl. Language-conditioned detection transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024

[39] [39]

arXiv preprint arXiv:2109.10852 , year=

Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection.arXiv preprint arXiv:2109.10852, 2021

work page arXiv 2021

[40] [40]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024

[43] [43]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition(CVPR), 2024

2024

[44] [44]

Detect what you can: Detecting and representing objects using holistic models and body parts

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014

2014

[45] [45]

Going denser with open- vocabulary part segmentation

Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, and Zhicheng Yan. Going denser with open- vocabulary part segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), 2023

2023

[46] [46]

Understanding multi-granularity for open- vocabulary part segmentation.Advances in Neural Information Processing Systems, 2024

Jiho Choi, Seonho Lee, Seungho Lee, Minhyun Lee, and Hyunjung Shim. Understanding multi-granularity for open- vocabulary part segmentation.Advances in Neural Information Processing Systems, 2024

2024

[47] [47]

Partglee: A foundation model for recognizing and parsing any objects

Junyi Li, Junfeng Wu, Weizhi Zhao, Song Bai, and Xiang Bai. Partglee: A foundation model for recognizing and parsing any objects. InEuropean Conference on Computer Vision(ECCV), 2024. 13 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

2024

[48] [48]

Instructpart: Task-oriented part segmentation with instruction reasoning

Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, and Katia P Sycara. Instructpart: Task-oriented part segmentation with instruction reasoning. InProceedings of the Meeting of the Association for Computational Linguistics, 2025

2025

[49] [49]

Affordance detection of tool parts from geometric features

Austin Myers, Ching L Teo, Cornelia Fermüller, and Yiannis Aloimonos. Affordance detection of tool parts from geometric features. InIEEE international conference on robotics and automation(ICRA), 2015

2015

[50] [50]

Object-based affordances detection with convolutional neural networks and dense conditional random fields

Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and Nikos G Tsagarakis. Object-based affordances detection with convolutional neural networks and dense conditional random fields. InIEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), 2017

2017

[51] [51]

Maskprompt: open-vocabulary affordance segmentation with object shape mask prompts

Dongpan Chen, Dehui Kong, Jinghua Li, and Baocai Yin. Maskprompt: open-vocabulary affordance segmentation with object shape mask prompts. InProceedings of the AAAI Conference on Artificial Intelligence(AAAI), 2025

2025

[52] [52]

Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter

Yingbo Tang, Shuaike Zhang, Xiaoshuai Hao, Pengwei Wang, Jianlong Wu, Zhongyuan Wang, and Shanghang Zhang. Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), 2025

2025

[53] [53]

Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

work page arXiv 2025

[54] [54]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2025

2025

[55] [55]

A4-agent: An agentic framework for zero-shot affordance reasoning.arXiv preprint arXiv:2512.14442, 2025

Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, and Ying-Cong Chen. A4-agent: An agentic framework for zero-shot affordance reasoning.arXiv preprint arXiv:2512.14442, 2025

work page arXiv 2025

[56] [56]

Glover++: Unleashing the potential of affordance learning from human behaviors for robotic manipulation.arXiv preprint arXiv:2505.11865, 2025

Teli Ma, Jia Zheng, Zifan Wang, Ziyao Gao, Jiaming Zhou, and Junwei Liang. Glover++: Unleashing the potential of affordance learning from human behaviors for robotic manipulation.arXiv preprint arXiv:2505.11865, 2025

work page arXiv 2025

[57] [57]

3d-affordancellm: Har- nessing large language models for open-vocabulary affordance detection in 3d worlds.arXiv preprint arXiv:2502.20041, 2025

Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye Hao, and Liqiang Nie. 3d-affordancellm: Har- nessing large language models for open-vocabulary affordance detection in 3d worlds.arXiv preprint arXiv:2502.20041, 2025

work page arXiv 2025

[58] [58]

Learning precise affordances from egocentric videos for robotic manipulation

Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, Kun Shao, and Laura Sevilla- Lara. Learning precise affordances from egocentric videos for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), 2025

2025

[59] [59]

Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009, 2025

Junha Lee, Eunha Park, Chunghyun Park, Dahyun Kang, and Minsu Cho. Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009, 2025

work page arXiv 2025

[60] [60]

Gpt-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

OpenAI. Gpt-5 system card.https://openai.com/index/gpt-5-system-card/, 2025

2025

[61] [61]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Egoobjects: A large-scale egocentric dataset for fine-grained object understanding

Chenchen Zhu, Fanyi Xiao, Andrés Alvarado, Yasmine Babaei, Jiabo Hu, Hichem El-Mohri, Sean Culatana, Roshan Sumbaly, and Zhicheng Yan. Egoobjects: A large-scale egocentric dataset for fine-grained object understanding. In Proceedings of the IEEE/CVF international conference on computer vision(ICCV), 2023

2023

[63] [63]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the International Conference on Machine Learning, 2009

2009

[64] [64]

Self-paced learning for latent variable models.Advances in Neural Information Processing Systems(NeuIPS), 2010

M Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models.Advances in Neural Information Processing Systems(NeuIPS), 2010

2010

[65] [65]

Competence-based curriculum learning for neural machine translation

Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell. Competence-based curriculum learning for neural machine translation. InProceedings of the conference of the North American chapter of the association for computational linguistics, 2019

2019

[66] [66]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems(NeuIPS), 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems(NeuIPS), 2023

2023

[67] [67]

Curriculum learning with quality-driven data selection.arXiv preprint arXiv:2407.00102, 2024

Biao Wu and Ling Chen. Curriculum learning with quality-driven data selection.arXiv preprint arXiv:2407.00102, 2024

work page arXiv 2024

[68] [68]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026

2026

[69] [69]

Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, 2026

Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, 2026

2026

[70] [70]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[71] [71]

Qwen3-VL Technical Report

Qwen Team. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [72]

Seed2.0 model card: Towards intelligence frontier for real-world complexity

ByteDance Seed Team. Seed2.0 model card: Towards intelligence frontier for real-world complexity. ByteDance Seed technical report, 2026

2026

[73] [73]

Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 15 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding A Formal Definition of Evaluation Metrics Overview.We provide the formal definition of ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

door”: “white door

Scene - level Object Grounding 1. Scene - level Object Description (GPT5) Scene - level object noun and referring expressions • “door”: “white door” • “ cookto p ” : “black cooktop” • “ socket ” : “white socket” • “ counte r ” : “kitchen counter with drawer” 2 . Object Grounding& Merging 2 . Object - level Part Parsing Object Bounding Box 3. Object Croppi...

[75] [75]

whisk handl e

Affordance Parsing 7. Affordance Region D escription (GPT5) • “ whisk handl e ” • “ door handl e ” • “ rim of the bowl ” • “ s ocket button ” 8 . Affordance Segmentation (SAM3) Hierarchical Scene Scene | -- Object: Socket | | -- Bbox | | -- Part: Button | | | -- Bbox | | | -- Affordance | | | -- Action | | | -- Point | | -- Part: Button | | -- Bbox | | --...

[76] [76]

Parse all visible interactive objects

[77] [77]

If no valid object, return {"objects":[]}

[78] [78]

Always include parts and affordances arrays (empty if none)

[79] [79]

Do not output extra keys or explanations

[80] [80]

string",

No placeholder literals like "string", "object", "part", "action". 19 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding F Qualitative Visualization GT obj:glaves obj:stove part:stovetop part:stovetop part:handel obj:basket obj:bag obj:bottle obj:drawer obj:cupboard obj:drawer obj:spatula obj:rag obj:bowl obj:ladle obj:mixer part:h...