COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition
Pith reviewed 2026-05-22 07:52 UTC · model grok-4.3
The pith
A large benchmark enables open tree decomposition of images into flexible hierarchical structures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalize the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Leveraging an automated pipeline that combines LVLMs for semantic reasoning and SAM 3 for geometric grounding, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes in an open-vocabulary space of over 3.5K unique labels. We establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency.
What carries the argument
The fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models with the precise geometric grounding of SAM 3 to produce hierarchical tree annotations at scale.
If this is right
- Models for hierarchical visual parsing can now be trained and compared at scale without predefined category limits.
- Research on complex physical assemblies can draw on the long-tail label distribution captured in the dataset.
- The OTQ metric supplies a consistent protocol for measuring both geometric accuracy and structural coherence in tree decompositions.
- Progress on scene understanding tasks that require flexible part-whole relations becomes directly testable against this benchmark.
Where Pith is reading between the lines
- The same automated pipeline approach could be adapted to generate hierarchical annotations for video or 3D scene data.
- Downstream applications such as robotic manipulation or augmented reality may benefit from using these open trees as intermediate representations.
- Analysis of the dataset could reveal statistical patterns in natural visual hierarchies that align with human perception across domains.
Load-bearing premise
The automated pipeline using LVLMs for semantic reasoning and SAM 3 for geometric grounding produces annotations that reliably match human structural judgment.
What would settle it
A large independent human study on a random subset of the COCOTree images that reveals substantial mismatches in chosen hierarchies, node boundaries, or component labels compared with the automated annotations.
Figures
read the original abstract
We formalize and enable the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Specifically, we provide the foundation benchmark for this new paradigm with the following three key contributions. First, we overcome the prohibitively high cognitive and physical bottlenecks of manual annotation by developing a fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with the precise geometric grounding of SAM 3. Second, leveraging this pipeline, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes. By embracing an open-vocabulary space of over 3.5K unique labels, it successfully captures the long-tail distribution of complex physical assemblies. Notably, rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment. Third, we establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency. We release our dataset and benchmark code at https://github.com/melonkick3090/COCOTree.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes the task of open tree-structured visual decomposition, which involves segmenting images into hierarchical trees of visual components with unconstrained granularity. It introduces an automated pipeline that combines Large Vision-Language Models (LVLMs) for semantic reasoning with SAM 3 for geometric grounding to generate annotations. Leveraging this pipeline, the authors construct the COCOTree dataset containing over 21K images and 1.8M structural nodes across an open vocabulary of 3.5K unique labels. They claim that rigorous human evaluation confirms strong alignment between the generated annotations and human structural judgment. Finally, they propose the Open Tree Quality (OTQ) metric to jointly evaluate mask precision, label accuracy, and structural consistency, and release the dataset along with benchmark code.
Significance. If the central claim of high-quality, human-aligned annotations holds, this work would provide a valuable large-scale foundation benchmark for a new paradigm in visual understanding that emphasizes hierarchical, open-vocabulary tree decompositions. The automated pipeline addresses annotation scalability challenges, and the OTQ metric offers a standardized evaluation protocol that could facilitate future research. The dataset's scale and coverage of long-tail distributions represent practical strengths for training and benchmarking models in complex scene decomposition tasks.
major comments (2)
- [Human evaluation section] Human evaluation section: The manuscript asserts that 'rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment,' yet provides no quantitative details on (a) the number of images or nodes sampled for review, (b) inter-annotator agreement statistics (e.g., Cohen's kappa or percentage agreement), or (c) whether the sampled subset is representative of the full 21K-image distribution and the 3.5K-label long tail. This validation is load-bearing for the claim that COCOTree can serve as reliable ground truth for model training and OTQ benchmarking.
- [Dataset construction pipeline (Section 3)] Dataset construction pipeline (Section 3): The description of how LVLM semantic outputs are fused with SAM 3 geometric grounding lacks specifics on error handling, conflict resolution, or failure modes for complex assemblies. Without these details or an error analysis across the 1.8M nodes, it is difficult to evaluate the reliability of the automated annotations at scale.
minor comments (3)
- [Abstract] The abstract would benefit from a concise statement of the OTQ metric's formulation or key components to better highlight the evaluation contribution.
- [Introduction and preliminaries] Notation for tree nodes and hierarchy levels should be defined more explicitly in the early sections to improve readability for readers unfamiliar with tree-structured decomposition.
- [Figures] Figure captions could more clearly indicate which elements represent semantic labels versus geometric masks in the example tree visualizations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will make revisions to improve clarity and completeness.
read point-by-point responses
-
Referee: [Human evaluation section] Human evaluation section: The manuscript asserts that 'rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment,' yet provides no quantitative details on (a) the number of images or nodes sampled for review, (b) inter-annotator agreement statistics (e.g., Cohen's kappa or percentage agreement), or (c) whether the sampled subset is representative of the full 21K-image distribution and the 3.5K-label long tail. This validation is load-bearing for the claim that COCOTree can serve as reliable ground truth for model training and OTQ benchmarking.
Authors: We agree that additional quantitative details would strengthen the human evaluation section. In the revised manuscript, we will expand the section to report the number of images and nodes sampled for review, inter-annotator agreement statistics (including percentage agreement and Cohen's kappa), and an explanation of how the sampled subset was selected to ensure representativeness across the 21K-image distribution and the long-tail labels. This will provide stronger support for the alignment claim. revision: yes
-
Referee: [Dataset construction pipeline (Section 3)] Dataset construction pipeline (Section 3): The description of how LVLM semantic outputs are fused with SAM 3 geometric grounding lacks specifics on error handling, conflict resolution, or failure modes for complex assemblies. Without these details or an error analysis across the 1.8M nodes, it is difficult to evaluate the reliability of the automated annotations at scale.
Authors: We acknowledge that more specifics on the fusion process would improve the pipeline description. We will revise Section 3 to detail error handling and conflict resolution mechanisms between LVLM semantic outputs and SAM 3 geometric grounding, along with discussion of failure modes for complex assemblies. We will also add an error analysis subsection with sampled statistics across the nodes to better demonstrate reliability at scale. revision: yes
Circularity Check
No circularity in dataset construction or metric proposal
full rationale
The paper introduces a new task of open tree decomposition and contributes an automated pipeline (LVLMs + SAM 3) to generate the COCOTree dataset, followed by human evaluation to validate alignment with human judgment and the OTQ metric for evaluation. No equations, derivations, fitted parameters, or self-citations are present in the provided text that would reduce any claim to its own inputs by construction. The human evaluation is described as an external confirmation step, not a self-referential loop, making the contribution self-contained as an empirical benchmark release.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large vision-language models can reliably identify and label visual components in a hierarchical manner without human supervision.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize ... open tree decomposition ... COCOTREE ... 21K images and 1.8M structural nodes ... Open Tree Quality (OTQ) metric
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with ... SAM 3
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Coco-stuff: Thing and stuff classes in context
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
work page 2018
-
[2]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Detect what you can: Detecting and representing objects using holistic models and body parts
Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1971–1978, 2014
work page 1971
-
[4]
Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. SAM4MLLM: Enhance multi-modal large language model for referring expression segmenta- tion.arXiv preprint arXiv:2409.10542, 2024
-
[5]
Part- aware panoptic segmentation
Daan de Geus, Panagiotis Meletis, Chenyang Lu, Xiaoxiao Wen, and Gijs Dubbelman. Part- aware panoptic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5481–5490, 2021
work page 2021
-
[6]
COCONut: Modernizing coco segmentation
Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, and Liang-Chieh Chen. COCONut: Modernizing coco segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21863–21873, 2024
work page 2024
-
[7]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019
work page 2019
-
[8]
LVIS: A dataset for large vocabulary instance segmentation
Agrim Gupta, Piotr Dollár, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356–5364, 2019
work page 2019
-
[9]
PartImageNet: A large, high-quality dataset of parts
Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xiaoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qihang Yu, and Alan Yuille. PartImageNet: A large, high-quality dataset of parts. InComputer Vision – ECCV 2022, pages 128–145. Springer, 2022
work page 2022
-
[10]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vision, pages 2961–2969, 2017
work page 2017
-
[11]
Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019
work page 2019
-
[12]
Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023
work page 2023
-
[13]
LISA: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[14]
Semantic-sam: Segment and recognize anything at any granu- larity
Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-SAM: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023. 10
-
[15]
Deep hierarchical semantic segmentation
Liulei Li, Tianfei Zhou, Wenguan Wang, Jianwu Li, and Yi Yang. Deep hierarchical semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1236–1247, 2022
work page 2022
-
[16]
Panoptic-partformer: Learning a unified model for panoptic part segmentation
Xiangtai Li, Shilin Xu, Jinheng Yang, Guangliang Cheng, Yunhai Tong, and Dacheng Tao. Panoptic-partformer: Learning a unified model for panoptic part segmentation. InComputer Vision – ECCV 2022. Springer, 2022
work page 2022
-
[17]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision – ECCV 2014, pages 740–755. Springer, 2014
work page 2014
-
[18]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. URLhttps://arxiv.org/abs/2304.08485
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Fully convolutional networks for seman- tic segmentation
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for seman- tic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015
work page 2015
-
[20]
McCrae, Alexandre Rademaker, Ewa Rudnicka, and Francis Bond
John P. McCrae, Alexandre Rademaker, Ewa Rudnicka, and Francis Bond. English WordNet 2020: Improving and extending a WordNet for english using an open-source methodology. In Proceedings of the LREC 2020 Workshop on Multimodal Wordnets, pages 14–19, Marseille, France, 2020. European Language Resources Association
work page 2020
-
[21]
Panagiotis Meletis, Xiaoxiao Wen, Chenyang Lu, Daan de Geus, and Gijs Dubbelman. Cityscapes-panoptic-parts and PASCAL-panoptic-parts datasets for scene understanding.arXiv preprint arXiv:2004.07944, 2020
-
[22]
George A. Miller. WordNet: A lexical database for english.Communications of the ACM, 38 (11):39–41, 1995
work page 1995
-
[23]
Hierarchical semantic segmentation with autoregressive language modeling
Josh Myers-Dean, Brian Price, Yifei Fan, and Danna Gurari. Hierarchical semantic segmentation with autoregressive language modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4120–4130, 2025
work page 2025
-
[24]
Spin: Hierarchical segmentation with subpart granularity in natural images
Josh Myers-Dean, Jarek Reynolds, Brian Price, Yifei Fan, and Danna Gurari. Spin: Hierarchical segmentation with subpart granularity in natural images. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 275–292, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-72691-0
work page 2024
-
[25]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139 ofProceedings...
work page 2021
-
[26]
Paco: Parts and attributes of common objects
Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, Amir Mousavi, Yiwen Song, Abhimanyu Dubey, and Dhruv Mahajan. Paco: Parts and attributes of common objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7141–7151, 2023
work page 2023
-
[27]
Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13009–13018, June 2024
work page 2024
-
[28]
Sentence-BERT: Sentence embeddings using siamese BERT- networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992. Association for Computational Linguistics, 2019. 11
work page 2019
-
[29]
PixelLM: Pixel reasoning with large multimodal model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[30]
Benchmarking object detectors with coco: A new path forward
Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, and Karan Desai. Benchmarking object detectors with coco: A new path forward. InEuropean Conference on Computer Vision (ECCV). Springer, 2024
work page 2024
-
[31]
Chufeng Tang, Lingxi Xie, Xiaopeng Zhang, Xiaolin Hu, and Qi Tian. Visual recognition by request. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15265–15274, 2023
work page 2023
-
[32]
Llm-seg: Bridging image segmentation and large language model reasoning, 2024
Junchi Wang and Lei Ke. Llm-seg: Bridging image segmentation and large language model reasoning, 2024. URLhttps://arxiv.org/abs/2404.08767
-
[33]
HIPIE: Hierarchical open-vocabulary universal image segmentation
Xudong Wang, Shufan Li, Konstantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. HIPIE: Hierarchical open-vocabulary universal image segmentation. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[34]
SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024
XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024
-
[35]
Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, and Jimin Liang. Bridging semantics and geometry: A decoupled LVLM-SAM framework for reasoning segmentation in optical remote sensing.arXiv preprint arXiv:2512.19302, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset.International Journal of Computer Vision, 127(3):302–321, 2019. 12 A Prompt Details This appendix summarizes the prompts used by the LVLM planner in our construction pipeline. The prompts are design...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.