COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition

Hyosu Kim; Junhyub Lee; Seunghun Chae

arxiv: 2605.22068 · v1 · pith:GCDKBCQRnew · submitted 2026-05-21 · 💻 cs.CV

COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition

Junhyub Lee , Seunghun Chae , Hyosu Kim This is my paper

Pith reviewed 2026-05-22 07:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords open tree decompositionhierarchical image segmentationbenchmark datasetopen vocabularytree-structured visual parsinglarge vision-language modelsstructural consistency metric

0 comments

The pith

A large benchmark enables open tree decomposition of images into flexible hierarchical structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces open tree decomposition as the task of segmenting an image into hierarchical trees of visual components without fixed granularity or closed vocabularies. It overcomes manual annotation costs with an automated pipeline that pairs large vision-language models for semantic reasoning and SAM 3 for geometric grounding. From this pipeline the authors build COCOTree, a dataset of over 21,000 images containing 1.8 million structural nodes across more than 3,500 unique labels that capture long-tail physical assemblies. They also define the Open Tree Quality metric to measure mask precision, label accuracy, and structural consistency together. Human evaluation shows the generated trees align closely with human structural judgments.

Core claim

We formalize the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Leveraging an automated pipeline that combines LVLMs for semantic reasoning and SAM 3 for geometric grounding, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes in an open-vocabulary space of over 3.5K unique labels. We establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency.

What carries the argument

The fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models with the precise geometric grounding of SAM 3 to produce hierarchical tree annotations at scale.

If this is right

Models for hierarchical visual parsing can now be trained and compared at scale without predefined category limits.
Research on complex physical assemblies can draw on the long-tail label distribution captured in the dataset.
The OTQ metric supplies a consistent protocol for measuring both geometric accuracy and structural coherence in tree decompositions.
Progress on scene understanding tasks that require flexible part-whole relations becomes directly testable against this benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same automated pipeline approach could be adapted to generate hierarchical annotations for video or 3D scene data.
Downstream applications such as robotic manipulation or augmented reality may benefit from using these open trees as intermediate representations.
Analysis of the dataset could reveal statistical patterns in natural visual hierarchies that align with human perception across domains.

Load-bearing premise

The automated pipeline using LVLMs for semantic reasoning and SAM 3 for geometric grounding produces annotations that reliably match human structural judgment.

What would settle it

A large independent human study on a random subset of the COCOTree images that reveals substantial mismatches in chosen hierarchies, node boundaries, or component labels compared with the automated annotations.

Figures

Figures reproduced from arXiv: 2605.22068 by Hyosu Kim, Junhyub Lee, Seunghun Chae.

**Figure 1.** Figure 1: Overview of COCOTREE. Left: COCOTREE provides dense open-tree annotations over COCO images. Middle: each image is decomposed into visible components grounded by instance masks. Right: the same annotation can be viewed as a semantic-node tree, which groups repeated masks under a shared local label, and as an instance-node tree, where each mask becomes a node with an explicit visual parent. inherent to the r… view at source ↗

**Figure 2.** Figure 2: Fully automated open tree construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Review website interface used for human evaluation. Reviewers used this interface to [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Label distribution treemap for COCOTREE [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Joint distribution of masks and labels per image in [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Spatial distribution of annotation centers in the reviewed samples. The plot summarizes [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

We formalize and enable the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Specifically, we provide the foundation benchmark for this new paradigm with the following three key contributions. First, we overcome the prohibitively high cognitive and physical bottlenecks of manual annotation by developing a fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with the precise geometric grounding of SAM 3. Second, leveraging this pipeline, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes. By embracing an open-vocabulary space of over 3.5K unique labels, it successfully captures the long-tail distribution of complex physical assemblies. Notably, rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment. Third, we establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency. We release our dataset and benchmark code at https://github.com/melonkick3090/COCOTree.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COCOTree supplies the first large-scale benchmark for open-vocabulary hierarchical tree decomposition, but its annotation quality rests on human checks whose scale and rigor are not yet detailed enough to fully support the claims.

read the letter

The paper introduces a new task called open tree decomposition and backs it with COCOTree, a dataset of 21K images containing 1.8M nodes across 3.5K labels. The automated pipeline pairs LVLMs for semantic reasoning with SAM 3 for mask grounding, then proposes the OTQ metric that scores mask precision, label accuracy, and structural consistency at once. They also release the data and code.

Referee Report

2 major / 3 minor

Summary. The paper formalizes the task of open tree-structured visual decomposition, which involves segmenting images into hierarchical trees of visual components with unconstrained granularity. It introduces an automated pipeline that combines Large Vision-Language Models (LVLMs) for semantic reasoning with SAM 3 for geometric grounding to generate annotations. Leveraging this pipeline, the authors construct the COCOTree dataset containing over 21K images and 1.8M structural nodes across an open vocabulary of 3.5K unique labels. They claim that rigorous human evaluation confirms strong alignment between the generated annotations and human structural judgment. Finally, they propose the Open Tree Quality (OTQ) metric to jointly evaluate mask precision, label accuracy, and structural consistency, and release the dataset along with benchmark code.

Significance. If the central claim of high-quality, human-aligned annotations holds, this work would provide a valuable large-scale foundation benchmark for a new paradigm in visual understanding that emphasizes hierarchical, open-vocabulary tree decompositions. The automated pipeline addresses annotation scalability challenges, and the OTQ metric offers a standardized evaluation protocol that could facilitate future research. The dataset's scale and coverage of long-tail distributions represent practical strengths for training and benchmarking models in complex scene decomposition tasks.

major comments (2)

[Human evaluation section] Human evaluation section: The manuscript asserts that 'rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment,' yet provides no quantitative details on (a) the number of images or nodes sampled for review, (b) inter-annotator agreement statistics (e.g., Cohen's kappa or percentage agreement), or (c) whether the sampled subset is representative of the full 21K-image distribution and the 3.5K-label long tail. This validation is load-bearing for the claim that COCOTree can serve as reliable ground truth for model training and OTQ benchmarking.
[Dataset construction pipeline (Section 3)] Dataset construction pipeline (Section 3): The description of how LVLM semantic outputs are fused with SAM 3 geometric grounding lacks specifics on error handling, conflict resolution, or failure modes for complex assemblies. Without these details or an error analysis across the 1.8M nodes, it is difficult to evaluate the reliability of the automated annotations at scale.

minor comments (3)

[Abstract] The abstract would benefit from a concise statement of the OTQ metric's formulation or key components to better highlight the evaluation contribution.
[Introduction and preliminaries] Notation for tree nodes and hierarchy levels should be defined more explicitly in the early sections to improve readability for readers unfamiliar with tree-structured decomposition.
[Figures] Figure captions could more clearly indicate which elements represent semantic labels versus geometric masks in the example tree visualizations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will make revisions to improve clarity and completeness.

read point-by-point responses

Referee: [Human evaluation section] Human evaluation section: The manuscript asserts that 'rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment,' yet provides no quantitative details on (a) the number of images or nodes sampled for review, (b) inter-annotator agreement statistics (e.g., Cohen's kappa or percentage agreement), or (c) whether the sampled subset is representative of the full 21K-image distribution and the 3.5K-label long tail. This validation is load-bearing for the claim that COCOTree can serve as reliable ground truth for model training and OTQ benchmarking.

Authors: We agree that additional quantitative details would strengthen the human evaluation section. In the revised manuscript, we will expand the section to report the number of images and nodes sampled for review, inter-annotator agreement statistics (including percentage agreement and Cohen's kappa), and an explanation of how the sampled subset was selected to ensure representativeness across the 21K-image distribution and the long-tail labels. This will provide stronger support for the alignment claim. revision: yes
Referee: [Dataset construction pipeline (Section 3)] Dataset construction pipeline (Section 3): The description of how LVLM semantic outputs are fused with SAM 3 geometric grounding lacks specifics on error handling, conflict resolution, or failure modes for complex assemblies. Without these details or an error analysis across the 1.8M nodes, it is difficult to evaluate the reliability of the automated annotations at scale.

Authors: We acknowledge that more specifics on the fusion process would improve the pipeline description. We will revise Section 3 to detail error handling and conflict resolution mechanisms between LVLM semantic outputs and SAM 3 geometric grounding, along with discussion of failure modes for complex assemblies. We will also add an error analysis subsection with sampled statistics across the nodes to better demonstrate reliability at scale. revision: yes

Circularity Check

0 steps flagged

No circularity in dataset construction or metric proposal

full rationale

The paper introduces a new task of open tree decomposition and contributes an automated pipeline (LVLMs + SAM 3) to generate the COCOTree dataset, followed by human evaluation to validate alignment with human judgment and the OTQ metric for evaluation. No equations, derivations, fitted parameters, or self-citations are present in the provided text that would reduce any claim to its own inputs by construction. The human evaluation is described as an external confirmation step, not a self-referential loop, making the contribution self-contained as an empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LVLM semantic reasoning combined with SAM 3 produces human-aligned hierarchical annotations at scale.

axioms (1)

domain assumption Large vision-language models can reliably identify and label visual components in a hierarchical manner without human supervision.
The automated generation pipeline is built on this capability.

pith-pipeline@v0.9.0 · 5737 in / 1169 out tokens · 34091 ms · 2026-05-22T07:52:42.884995+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize ... open tree decomposition ... COCOTREE ... 21K images and 1.8M structural nodes ... Open Tree Quality (OTQ) metric
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with ... SAM 3

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

[1]

Coco-stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018
[2]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Detect what you can: Detecting and representing objects using holistic models and body parts

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1971–1978, 2014

work page 1971
[4]

SAM4MLLM: Enhance multi-modal large language model for referring expression segmenta- tion.arXiv preprint arXiv:2409.10542, 2024

Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. SAM4MLLM: Enhance multi-modal large language model for referring expression segmenta- tion.arXiv preprint arXiv:2409.10542, 2024

work page arXiv 2024
[5]

Part- aware panoptic segmentation

Daan de Geus, Panagiotis Meletis, Chenyang Lu, Xiaoxiao Wen, and Gijs Dubbelman. Part- aware panoptic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5481–5490, 2021

work page 2021
[6]

COCONut: Modernizing coco segmentation

Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, and Liang-Chieh Chen. COCONut: Modernizing coco segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21863–21873, 2024

work page 2024
[7]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019

work page 2019
[8]

LVIS: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollár, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356–5364, 2019

work page 2019
[9]

PartImageNet: A large, high-quality dataset of parts

Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xiaoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qihang Yu, and Alan Yuille. PartImageNet: A large, high-quality dataset of parts. InComputer Vision – ECCV 2022, pages 128–145. Springer, 2022

work page 2022
[10]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vision, pages 2961–2969, 2017

work page 2017
[11]

Panoptic segmentation

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019

work page 2019
[12]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

work page 2023
[13]

LISA: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[14]

Semantic-sam: Segment and recognize anything at any granu- larity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-SAM: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023. 10

work page arXiv 2023
[15]

Deep hierarchical semantic segmentation

Liulei Li, Tianfei Zhou, Wenguan Wang, Jianwu Li, and Yi Yang. Deep hierarchical semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1236–1247, 2022

work page 2022
[16]

Panoptic-partformer: Learning a unified model for panoptic part segmentation

Xiangtai Li, Shilin Xu, Jinheng Yang, Guangliang Cheng, Yunhai Tong, and Dacheng Tao. Panoptic-partformer: Learning a unified model for panoptic part segmentation. InComputer Vision – ECCV 2022. Springer, 2022

work page 2022
[17]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision – ECCV 2014, pages 740–755. Springer, 2014

work page 2014
[18]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. URLhttps://arxiv.org/abs/2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Fully convolutional networks for seman- tic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for seman- tic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015

work page 2015
[20]

McCrae, Alexandre Rademaker, Ewa Rudnicka, and Francis Bond

John P. McCrae, Alexandre Rademaker, Ewa Rudnicka, and Francis Bond. English WordNet 2020: Improving and extending a WordNet for english using an open-source methodology. In Proceedings of the LREC 2020 Workshop on Multimodal Wordnets, pages 14–19, Marseille, France, 2020. European Language Resources Association

work page 2020
[21]

Cityscapes-panoptic-parts and PASCAL-panoptic-parts datasets for scene understanding.arXiv preprint arXiv:2004.07944, 2020

Panagiotis Meletis, Xiaoxiao Wen, Chenyang Lu, Daan de Geus, and Gijs Dubbelman. Cityscapes-panoptic-parts and PASCAL-panoptic-parts datasets for scene understanding.arXiv preprint arXiv:2004.07944, 2020

work page arXiv 2004
[22]

George A. Miller. WordNet: A lexical database for english.Communications of the ACM, 38 (11):39–41, 1995

work page 1995
[23]

Hierarchical semantic segmentation with autoregressive language modeling

Josh Myers-Dean, Brian Price, Yifei Fan, and Danna Gurari. Hierarchical semantic segmentation with autoregressive language modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4120–4130, 2025

work page 2025
[24]

Spin: Hierarchical segmentation with subpart granularity in natural images

Josh Myers-Dean, Jarek Reynolds, Brian Price, Yifei Fan, and Danna Gurari. Spin: Hierarchical segmentation with subpart granularity in natural images. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 275–292, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-72691-0

work page 2024
[25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139 ofProceedings...

work page 2021
[26]

Paco: Parts and attributes of common objects

Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, Amir Mousavi, Yiwen Song, Abhimanyu Dubey, and Dhruv Mahajan. Paco: Parts and attributes of common objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7141–7151, 2023

work page 2023
[27]

Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13009–13018, June 2024

work page 2024
[28]

Sentence-BERT: Sentence embeddings using siamese BERT- networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992. Association for Computational Linguistics, 2019. 11

work page 2019
[29]

PixelLM: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[30]

Benchmarking object detectors with coco: A new path forward

Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, and Karan Desai. Benchmarking object detectors with coco: A new path forward. InEuropean Conference on Computer Vision (ECCV). Springer, 2024

work page 2024
[31]

Visual recognition by request

Chufeng Tang, Lingxi Xie, Xiaopeng Zhang, Xiaolin Hu, and Qi Tian. Visual recognition by request. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15265–15274, 2023

work page 2023
[32]

Llm-seg: Bridging image segmentation and large language model reasoning, 2024

Junchi Wang and Lei Ke. Llm-seg: Bridging image segmentation and large language model reasoning, 2024. URLhttps://arxiv.org/abs/2404.08767

work page arXiv 2024
[33]

HIPIE: Hierarchical open-vocabulary universal image segmentation

Xudong Wang, Shufan Li, Konstantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. HIPIE: Hierarchical open-vocabulary universal image segmentation. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[34]

SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

work page arXiv 2024
[35]

Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing

Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, and Jimin Liang. Bridging semantics and geometry: A decoupled LVLM-SAM framework for reasoning segmentation in optical remote sensing.arXiv preprint arXiv:2512.19302, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Semantic understanding of scenes through the ade20k dataset.International Journal of Computer Vision, 127(3):302–321, 2019

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset.International Journal of Computer Vision, 127(3):302–321, 2019. 12 A Prompt Details This appendix summarizes the prompts used by the LVLM planner in our construction pipeline. The prompts are design...

work page 2019

[1] [1]

Coco-stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018

[2] [2]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Detect what you can: Detecting and representing objects using holistic models and body parts

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1971–1978, 2014

work page 1971

[4] [4]

SAM4MLLM: Enhance multi-modal large language model for referring expression segmenta- tion.arXiv preprint arXiv:2409.10542, 2024

Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. SAM4MLLM: Enhance multi-modal large language model for referring expression segmenta- tion.arXiv preprint arXiv:2409.10542, 2024

work page arXiv 2024

[5] [5]

Part- aware panoptic segmentation

Daan de Geus, Panagiotis Meletis, Chenyang Lu, Xiaoxiao Wen, and Gijs Dubbelman. Part- aware panoptic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5481–5490, 2021

work page 2021

[6] [6]

COCONut: Modernizing coco segmentation

Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, and Liang-Chieh Chen. COCONut: Modernizing coco segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21863–21873, 2024

work page 2024

[7] [7]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019

work page 2019

[8] [8]

LVIS: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollár, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356–5364, 2019

work page 2019

[9] [9]

PartImageNet: A large, high-quality dataset of parts

Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xiaoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qihang Yu, and Alan Yuille. PartImageNet: A large, high-quality dataset of parts. InComputer Vision – ECCV 2022, pages 128–145. Springer, 2022

work page 2022

[10] [10]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vision, pages 2961–2969, 2017

work page 2017

[11] [11]

Panoptic segmentation

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019

work page 2019

[12] [12]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

work page 2023

[13] [13]

LISA: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[14] [14]

Semantic-sam: Segment and recognize anything at any granu- larity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-SAM: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023. 10

work page arXiv 2023

[15] [15]

Deep hierarchical semantic segmentation

Liulei Li, Tianfei Zhou, Wenguan Wang, Jianwu Li, and Yi Yang. Deep hierarchical semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1236–1247, 2022

work page 2022

[16] [16]

Panoptic-partformer: Learning a unified model for panoptic part segmentation

Xiangtai Li, Shilin Xu, Jinheng Yang, Guangliang Cheng, Yunhai Tong, and Dacheng Tao. Panoptic-partformer: Learning a unified model for panoptic part segmentation. InComputer Vision – ECCV 2022. Springer, 2022

work page 2022

[17] [17]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision – ECCV 2014, pages 740–755. Springer, 2014

work page 2014

[18] [18]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. URLhttps://arxiv.org/abs/2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Fully convolutional networks for seman- tic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for seman- tic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015

work page 2015

[20] [20]

McCrae, Alexandre Rademaker, Ewa Rudnicka, and Francis Bond

John P. McCrae, Alexandre Rademaker, Ewa Rudnicka, and Francis Bond. English WordNet 2020: Improving and extending a WordNet for english using an open-source methodology. In Proceedings of the LREC 2020 Workshop on Multimodal Wordnets, pages 14–19, Marseille, France, 2020. European Language Resources Association

work page 2020

[21] [21]

Cityscapes-panoptic-parts and PASCAL-panoptic-parts datasets for scene understanding.arXiv preprint arXiv:2004.07944, 2020

Panagiotis Meletis, Xiaoxiao Wen, Chenyang Lu, Daan de Geus, and Gijs Dubbelman. Cityscapes-panoptic-parts and PASCAL-panoptic-parts datasets for scene understanding.arXiv preprint arXiv:2004.07944, 2020

work page arXiv 2004

[22] [22]

George A. Miller. WordNet: A lexical database for english.Communications of the ACM, 38 (11):39–41, 1995

work page 1995

[23] [23]

Hierarchical semantic segmentation with autoregressive language modeling

Josh Myers-Dean, Brian Price, Yifei Fan, and Danna Gurari. Hierarchical semantic segmentation with autoregressive language modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4120–4130, 2025

work page 2025

[24] [24]

Spin: Hierarchical segmentation with subpart granularity in natural images

Josh Myers-Dean, Jarek Reynolds, Brian Price, Yifei Fan, and Danna Gurari. Spin: Hierarchical segmentation with subpart granularity in natural images. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 275–292, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-72691-0

work page 2024

[25] [25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139 ofProceedings...

work page 2021

[26] [26]

Paco: Parts and attributes of common objects

Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, Amir Mousavi, Yiwen Song, Abhimanyu Dubey, and Dhruv Mahajan. Paco: Parts and attributes of common objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7141–7151, 2023

work page 2023

[27] [27]

Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13009–13018, June 2024

work page 2024

[28] [28]

Sentence-BERT: Sentence embeddings using siamese BERT- networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992. Association for Computational Linguistics, 2019. 11

work page 2019

[29] [29]

PixelLM: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[30] [30]

Benchmarking object detectors with coco: A new path forward

Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, and Karan Desai. Benchmarking object detectors with coco: A new path forward. InEuropean Conference on Computer Vision (ECCV). Springer, 2024

work page 2024

[31] [31]

Visual recognition by request

Chufeng Tang, Lingxi Xie, Xiaopeng Zhang, Xiaolin Hu, and Qi Tian. Visual recognition by request. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15265–15274, 2023

work page 2023

[32] [32]

Llm-seg: Bridging image segmentation and large language model reasoning, 2024

Junchi Wang and Lei Ke. Llm-seg: Bridging image segmentation and large language model reasoning, 2024. URLhttps://arxiv.org/abs/2404.08767

work page arXiv 2024

[33] [33]

HIPIE: Hierarchical open-vocabulary universal image segmentation

Xudong Wang, Shufan Li, Konstantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. HIPIE: Hierarchical open-vocabulary universal image segmentation. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[34] [34]

SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

work page arXiv 2024

[35] [35]

Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing

Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, and Jimin Liang. Bridging semantics and geometry: A decoupled LVLM-SAM framework for reasoning segmentation in optical remote sensing.arXiv preprint arXiv:2512.19302, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Semantic understanding of scenes through the ade20k dataset.International Journal of Computer Vision, 127(3):302–321, 2019

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset.International Journal of Computer Vision, 127(3):302–321, 2019. 12 A Prompt Details This appendix summarizes the prompts used by the LVLM planner in our construction pipeline. The prompts are design...

work page 2019