pith. sign in

arxiv: 2605.22068 · v1 · pith:GCDKBCQRnew · submitted 2026-05-21 · 💻 cs.CV

COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition

Pith reviewed 2026-05-22 07:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords open tree decompositionhierarchical image segmentationbenchmark datasetopen vocabularytree-structured visual parsinglarge vision-language modelsstructural consistency metric
0
0 comments X

The pith

A large benchmark enables open tree decomposition of images into flexible hierarchical structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces open tree decomposition as the task of segmenting an image into hierarchical trees of visual components without fixed granularity or closed vocabularies. It overcomes manual annotation costs with an automated pipeline that pairs large vision-language models for semantic reasoning and SAM 3 for geometric grounding. From this pipeline the authors build COCOTree, a dataset of over 21,000 images containing 1.8 million structural nodes across more than 3,500 unique labels that capture long-tail physical assemblies. They also define the Open Tree Quality metric to measure mask precision, label accuracy, and structural consistency together. Human evaluation shows the generated trees align closely with human structural judgments.

Core claim

We formalize the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Leveraging an automated pipeline that combines LVLMs for semantic reasoning and SAM 3 for geometric grounding, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes in an open-vocabulary space of over 3.5K unique labels. We establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency.

What carries the argument

The fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models with the precise geometric grounding of SAM 3 to produce hierarchical tree annotations at scale.

If this is right

  • Models for hierarchical visual parsing can now be trained and compared at scale without predefined category limits.
  • Research on complex physical assemblies can draw on the long-tail label distribution captured in the dataset.
  • The OTQ metric supplies a consistent protocol for measuring both geometric accuracy and structural coherence in tree decompositions.
  • Progress on scene understanding tasks that require flexible part-whole relations becomes directly testable against this benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same automated pipeline approach could be adapted to generate hierarchical annotations for video or 3D scene data.
  • Downstream applications such as robotic manipulation or augmented reality may benefit from using these open trees as intermediate representations.
  • Analysis of the dataset could reveal statistical patterns in natural visual hierarchies that align with human perception across domains.

Load-bearing premise

The automated pipeline using LVLMs for semantic reasoning and SAM 3 for geometric grounding produces annotations that reliably match human structural judgment.

What would settle it

A large independent human study on a random subset of the COCOTree images that reveals substantial mismatches in chosen hierarchies, node boundaries, or component labels compared with the automated annotations.

Figures

Figures reproduced from arXiv: 2605.22068 by Hyosu Kim, Junhyub Lee, Seunghun Chae.

Figure 1
Figure 1. Figure 1: Overview of COCOTREE. Left: COCOTREE provides dense open-tree annotations over COCO images. Middle: each image is decomposed into visible components grounded by instance masks. Right: the same annotation can be viewed as a semantic-node tree, which groups repeated masks under a shared local label, and as an instance-node tree, where each mask becomes a node with an explicit visual parent. inherent to the r… view at source ↗
Figure 2
Figure 2. Figure 2: Fully automated open tree construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Review website interface used for human evaluation. Reviewers used this interface to [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Label distribution treemap for COCOTREE [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Joint distribution of masks and labels per image in [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Spatial distribution of annotation centers in the reviewed samples. The plot summarizes [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

We formalize and enable the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Specifically, we provide the foundation benchmark for this new paradigm with the following three key contributions. First, we overcome the prohibitively high cognitive and physical bottlenecks of manual annotation by developing a fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with the precise geometric grounding of SAM 3. Second, leveraging this pipeline, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes. By embracing an open-vocabulary space of over 3.5K unique labels, it successfully captures the long-tail distribution of complex physical assemblies. Notably, rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment. Third, we establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency. We release our dataset and benchmark code at https://github.com/melonkick3090/COCOTree.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper formalizes the task of open tree-structured visual decomposition, which involves segmenting images into hierarchical trees of visual components with unconstrained granularity. It introduces an automated pipeline that combines Large Vision-Language Models (LVLMs) for semantic reasoning with SAM 3 for geometric grounding to generate annotations. Leveraging this pipeline, the authors construct the COCOTree dataset containing over 21K images and 1.8M structural nodes across an open vocabulary of 3.5K unique labels. They claim that rigorous human evaluation confirms strong alignment between the generated annotations and human structural judgment. Finally, they propose the Open Tree Quality (OTQ) metric to jointly evaluate mask precision, label accuracy, and structural consistency, and release the dataset along with benchmark code.

Significance. If the central claim of high-quality, human-aligned annotations holds, this work would provide a valuable large-scale foundation benchmark for a new paradigm in visual understanding that emphasizes hierarchical, open-vocabulary tree decompositions. The automated pipeline addresses annotation scalability challenges, and the OTQ metric offers a standardized evaluation protocol that could facilitate future research. The dataset's scale and coverage of long-tail distributions represent practical strengths for training and benchmarking models in complex scene decomposition tasks.

major comments (2)
  1. [Human evaluation section] Human evaluation section: The manuscript asserts that 'rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment,' yet provides no quantitative details on (a) the number of images or nodes sampled for review, (b) inter-annotator agreement statistics (e.g., Cohen's kappa or percentage agreement), or (c) whether the sampled subset is representative of the full 21K-image distribution and the 3.5K-label long tail. This validation is load-bearing for the claim that COCOTree can serve as reliable ground truth for model training and OTQ benchmarking.
  2. [Dataset construction pipeline (Section 3)] Dataset construction pipeline (Section 3): The description of how LVLM semantic outputs are fused with SAM 3 geometric grounding lacks specifics on error handling, conflict resolution, or failure modes for complex assemblies. Without these details or an error analysis across the 1.8M nodes, it is difficult to evaluate the reliability of the automated annotations at scale.
minor comments (3)
  1. [Abstract] The abstract would benefit from a concise statement of the OTQ metric's formulation or key components to better highlight the evaluation contribution.
  2. [Introduction and preliminaries] Notation for tree nodes and hierarchy levels should be defined more explicitly in the early sections to improve readability for readers unfamiliar with tree-structured decomposition.
  3. [Figures] Figure captions could more clearly indicate which elements represent semantic labels versus geometric masks in the example tree visualizations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will make revisions to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Human evaluation section] Human evaluation section: The manuscript asserts that 'rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment,' yet provides no quantitative details on (a) the number of images or nodes sampled for review, (b) inter-annotator agreement statistics (e.g., Cohen's kappa or percentage agreement), or (c) whether the sampled subset is representative of the full 21K-image distribution and the 3.5K-label long tail. This validation is load-bearing for the claim that COCOTree can serve as reliable ground truth for model training and OTQ benchmarking.

    Authors: We agree that additional quantitative details would strengthen the human evaluation section. In the revised manuscript, we will expand the section to report the number of images and nodes sampled for review, inter-annotator agreement statistics (including percentage agreement and Cohen's kappa), and an explanation of how the sampled subset was selected to ensure representativeness across the 21K-image distribution and the long-tail labels. This will provide stronger support for the alignment claim. revision: yes

  2. Referee: [Dataset construction pipeline (Section 3)] Dataset construction pipeline (Section 3): The description of how LVLM semantic outputs are fused with SAM 3 geometric grounding lacks specifics on error handling, conflict resolution, or failure modes for complex assemblies. Without these details or an error analysis across the 1.8M nodes, it is difficult to evaluate the reliability of the automated annotations at scale.

    Authors: We acknowledge that more specifics on the fusion process would improve the pipeline description. We will revise Section 3 to detail error handling and conflict resolution mechanisms between LVLM semantic outputs and SAM 3 geometric grounding, along with discussion of failure modes for complex assemblies. We will also add an error analysis subsection with sampled statistics across the nodes to better demonstrate reliability at scale. revision: yes

Circularity Check

0 steps flagged

No circularity in dataset construction or metric proposal

full rationale

The paper introduces a new task of open tree decomposition and contributes an automated pipeline (LVLMs + SAM 3) to generate the COCOTree dataset, followed by human evaluation to validate alignment with human judgment and the OTQ metric for evaluation. No equations, derivations, fitted parameters, or self-citations are present in the provided text that would reduce any claim to its own inputs by construction. The human evaluation is described as an external confirmation step, not a self-referential loop, making the contribution self-contained as an empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LVLM semantic reasoning combined with SAM 3 produces human-aligned hierarchical annotations at scale.

axioms (1)
  • domain assumption Large vision-language models can reliably identify and label visual components in a hierarchical manner without human supervision.
    The automated generation pipeline is built on this capability.

pith-pipeline@v0.9.0 · 5737 in / 1169 out tokens · 34091 ms · 2026-05-22T07:52:42.884995+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

  1. [1]

    Coco-stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

  2. [2]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

  3. [3]

    Detect what you can: Detecting and representing objects using holistic models and body parts

    Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1971–1978, 2014

  4. [4]

    SAM4MLLM: Enhance multi-modal large language model for referring expression segmenta- tion.arXiv preprint arXiv:2409.10542, 2024

    Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. SAM4MLLM: Enhance multi-modal large language model for referring expression segmenta- tion.arXiv preprint arXiv:2409.10542, 2024

  5. [5]

    Part- aware panoptic segmentation

    Daan de Geus, Panagiotis Meletis, Chenyang Lu, Xiaoxiao Wen, and Gijs Dubbelman. Part- aware panoptic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5481–5490, 2021

  6. [6]

    COCONut: Modernizing coco segmentation

    Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, and Liang-Chieh Chen. COCONut: Modernizing coco segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21863–21873, 2024

  7. [7]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019

  8. [8]

    LVIS: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollár, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356–5364, 2019

  9. [9]

    PartImageNet: A large, high-quality dataset of parts

    Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xiaoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qihang Yu, and Alan Yuille. PartImageNet: A large, high-quality dataset of parts. InComputer Vision – ECCV 2022, pages 128–145. Springer, 2022

  10. [10]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vision, pages 2961–2969, 2017

  11. [11]

    Panoptic segmentation

    Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019

  12. [12]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

  13. [13]

    LISA: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  14. [14]

    Semantic-sam: Segment and recognize anything at any granu- larity

    Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-SAM: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023. 10

  15. [15]

    Deep hierarchical semantic segmentation

    Liulei Li, Tianfei Zhou, Wenguan Wang, Jianwu Li, and Yi Yang. Deep hierarchical semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1236–1247, 2022

  16. [16]

    Panoptic-partformer: Learning a unified model for panoptic part segmentation

    Xiangtai Li, Shilin Xu, Jinheng Yang, Guangliang Cheng, Yunhai Tong, and Dacheng Tao. Panoptic-partformer: Learning a unified model for panoptic part segmentation. InComputer Vision – ECCV 2022. Springer, 2022

  17. [17]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision – ECCV 2014, pages 740–755. Springer, 2014

  18. [18]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. URLhttps://arxiv.org/abs/2304.08485

  19. [19]

    Fully convolutional networks for seman- tic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for seman- tic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015

  20. [20]

    McCrae, Alexandre Rademaker, Ewa Rudnicka, and Francis Bond

    John P. McCrae, Alexandre Rademaker, Ewa Rudnicka, and Francis Bond. English WordNet 2020: Improving and extending a WordNet for english using an open-source methodology. In Proceedings of the LREC 2020 Workshop on Multimodal Wordnets, pages 14–19, Marseille, France, 2020. European Language Resources Association

  21. [21]

    Cityscapes-panoptic-parts and PASCAL-panoptic-parts datasets for scene understanding.arXiv preprint arXiv:2004.07944, 2020

    Panagiotis Meletis, Xiaoxiao Wen, Chenyang Lu, Daan de Geus, and Gijs Dubbelman. Cityscapes-panoptic-parts and PASCAL-panoptic-parts datasets for scene understanding.arXiv preprint arXiv:2004.07944, 2020

  22. [22]

    George A. Miller. WordNet: A lexical database for english.Communications of the ACM, 38 (11):39–41, 1995

  23. [23]

    Hierarchical semantic segmentation with autoregressive language modeling

    Josh Myers-Dean, Brian Price, Yifei Fan, and Danna Gurari. Hierarchical semantic segmentation with autoregressive language modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4120–4130, 2025

  24. [24]

    Spin: Hierarchical segmentation with subpart granularity in natural images

    Josh Myers-Dean, Jarek Reynolds, Brian Price, Yifei Fan, and Danna Gurari. Spin: Hierarchical segmentation with subpart granularity in natural images. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 275–292, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-72691-0

  25. [25]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139 ofProceedings...

  26. [26]

    Paco: Parts and attributes of common objects

    Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, Amir Mousavi, Yiwen Song, Abhimanyu Dubey, and Dhruv Mahajan. Paco: Parts and attributes of common objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7141–7151, 2023

  27. [27]

    Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13009–13018, June 2024

  28. [28]

    Sentence-BERT: Sentence embeddings using siamese BERT- networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992. Association for Computational Linguistics, 2019. 11

  29. [29]

    PixelLM: Pixel reasoning with large multimodal model

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  30. [30]

    Benchmarking object detectors with coco: A new path forward

    Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, and Karan Desai. Benchmarking object detectors with coco: A new path forward. InEuropean Conference on Computer Vision (ECCV). Springer, 2024

  31. [31]

    Visual recognition by request

    Chufeng Tang, Lingxi Xie, Xiaopeng Zhang, Xiaolin Hu, and Qi Tian. Visual recognition by request. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15265–15274, 2023

  32. [32]

    Llm-seg: Bridging image segmentation and large language model reasoning, 2024

    Junchi Wang and Lei Ke. Llm-seg: Bridging image segmentation and large language model reasoning, 2024. URLhttps://arxiv.org/abs/2404.08767

  33. [33]

    HIPIE: Hierarchical open-vocabulary universal image segmentation

    Xudong Wang, Shufan Li, Konstantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. HIPIE: Hierarchical open-vocabulary universal image segmentation. InAdvances in Neural Information Processing Systems, 2023

  34. [34]

    SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

    XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

  35. [35]

    Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing

    Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, and Jimin Liang. Bridging semantics and geometry: A decoupled LVLM-SAM framework for reasoning segmentation in optical remote sensing.arXiv preprint arXiv:2512.19302, 2025

  36. [36]

    Semantic understanding of scenes through the ade20k dataset.International Journal of Computer Vision, 127(3):302–321, 2019

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset.International Journal of Computer Vision, 127(3):302–321, 2019. 12 A Prompt Details This appendix summarizes the prompts used by the LVLM planner in our construction pipeline. The prompts are design...