pith. machine review for the scientific record. sign in

arxiv: 2605.08874 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Semantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary semantic segmentationhyperbolic geometryPoincaré ball modelCLIP adaptationhierarchical alignmentsemantic alignmentorthogonal transformationfine-tuning framework
0
0 comments X

The pith

HyRo decouples hierarchical and semantic alignment in the Poincaré ball to advance open-vocabulary semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of adapting image-level vision-language models like CLIP to pixel-level open-vocabulary semantic segmentation, where embedding spaces struggle with both hierarchy and precise meaning. Prior hyperbolic methods capture hierarchical structures but leave semantic misalignments unaddressed within the same level. HyRo introduces a fine-tuning approach that separates these concerns by tuning the hyperbolic radius to align levels and applying an orthogonal transformation to refine semantics while keeping that radius fixed. This separation leads to measurable gains on standard benchmarks over earlier techniques.

Core claim

HyRo is a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincaré ball model. It aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius, achieving state-of-the-art performance over prior methods on standard open-vocabulary semantic segmentation benchmarks.

What carries the argument

The radius-preserving orthogonal transformation in the Poincaré ball model that enables angular semantic alignment independently of hierarchical radius adjustments.

If this is right

  • Hierarchical levels align when the hyperbolic radius is adjusted.
  • Semantic relationships within levels improve through angular alignment without radius change.
  • The combination resolves misalignments overlooked by previous hyperbolic approaches.
  • Performance exceeds prior methods on open-vocabulary semantic segmentation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The radius-preserving property could let the same transformation be reused across different embedding scales without retraining the hierarchy.
  • Embedding visualizations after each step might show tighter semantic clusters within levels while keeping level separations intact.
  • The decoupling idea may transfer to other dense prediction tasks that mix coarse taxonomy with fine-grained labels.

Load-bearing premise

The orthogonal transformation refines semantic relationships while preserving the hyperbolic radius, and this specific decoupling fixes the within-level semantic misalignment that prior hyperbolic methods missed.

What would settle it

An ablation experiment that replaces the orthogonal transformation with a radius-altering alternative or removes it entirely, then checks whether within-level semantic consistency and benchmark scores drop to match or fall below prior hyperbolic baselines.

Figures

Figures reproduced from arXiv: 2605.08874 by Dang Huynh, Hai Nguyen-Truong, Hoang M. Truong.

Figure 1
Figure 1. Figure 1: HyRo rotates the text embeddings to achieve a smaller [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of HyRo. Embeddings are rotated around [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of HyRo. Given image and text inputs, Euclidean embeddings are mapped to the Poincare ball via ´ the exponential map. HyRo then decouples alignment into two stages: (1) Hierarchical Adjustment using block-diagonal radius scaling matrices to align granularity, and (2) Semantic Refinement using orthogonal rotation matrices to adjust angular relationships without altering the radius. The … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison between HyperCLIP [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attention map visualization for target classes “person”, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Open-vocabulary semantic segmentation requires adapting image-level vision-language models such as CLIP to dense pixel-level prediction, which is challenging due to the mismatch between hierarchical structure and semantic alignment in the embedding space. While recent works leverage hyperbolic geometry to model hierarchical relationships, they align embeddings across hierarchical levels but overlook semantic misalignment among embeddings within the same level. In this work, we propose HyRo, a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincar\'e ball model. HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius. Experiments on standard open-vocabulary semantic segmentation benchmarks demonstrate that HyRo achieves state-of-the-art performance over prior methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HyRo, a hyperbolic fine-tuning framework for open-vocabulary semantic segmentation that operates in the Poincaré ball model. It decouples hierarchical alignment (via adjustment of the hyperbolic radius) from within-level semantic alignment (via an origin-centered orthogonal transformation). The authors claim that the orthogonal step refines semantic relationships while theoretically preserving the hyperbolic radius, and that this addresses a limitation in prior hyperbolic methods, yielding state-of-the-art results on standard benchmarks.

Significance. If the empirical results and the claimed decoupling hold under scrutiny, the work provides a clean separation of hierarchy and semantics in hyperbolic embeddings for dense prediction. The explicit invocation of the norm-preservation property of orthogonal matrices (which directly implies preservation of artanh(‖·‖)) is a genuine strength, as it supplies a parameter-free theoretical justification for the semantic-refinement step without disturbing the hierarchical structure.

major comments (2)
  1. [§3 (Method)] §3 (Method): The central claim that radius adjustment and the orthogonal transformation are decoupled and that the latter specifically resolves within-level semantic misalignment rests on the construction, but the manuscript supplies no explicit loss terms, optimization schedule, or small-scale derivation showing that angular alignment improves semantic metrics independently of radius changes. Without this, it is unclear whether the observed gains are attributable to the proposed decoupling rather than to generic fine-tuning.
  2. [§4 (Experiments)] §4 (Experiments): The SOTA claim is load-bearing for the contribution, yet the manuscript provides no tables with per-dataset mIoU numbers, standard deviations across runs, or ablations that isolate the radius-adjustment component from the orthogonal component. This absence prevents verification that the decoupling, rather than other implementation choices, drives the reported improvements over prior hyperbolic baselines.
minor comments (2)
  1. [Abstract] Abstract: The abstract states the SOTA result but omits the concrete benchmarks (e.g., ADE20K, Pascal-Context) and the magnitude of improvement; adding one quantitative sentence would strengthen the summary.
  2. [Notation] Notation: The hyperbolic radius is referred to interchangeably as “radius” and “hyperbolic radius”; a single consistent symbol (e.g., r or ρ) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the theoretical strength of the norm-preservation argument. We respond to each major comment below and indicate the revisions we will undertake.

read point-by-point responses
  1. Referee: [§3 (Method)] §3 (Method): The central claim that radius adjustment and the orthogonal transformation are decoupled and that the latter specifically resolves within-level semantic misalignment rests on the construction, but the manuscript supplies no explicit loss terms, optimization schedule, or small-scale derivation showing that angular alignment improves semantic metrics independently of radius changes. Without this, it is unclear whether the observed gains are attributable to the proposed decoupling rather than to generic fine-tuning.

    Authors: Section 3 presents the procedure as two sequential operations: radius scaling first aligns hierarchical levels by moving embeddings along radial lines in the Poincaré ball, after which an origin-centered orthogonal matrix is applied to rotate embeddings while exactly preserving their Euclidean norms and therefore their hyperbolic radii (via the identity artanh(‖Qx‖) = artanh(‖x‖) for orthogonal Q). The fine-tuning objective is a standard contrastive loss applied to the final embeddings. We agree that an explicit derivation isolating the angular-alignment effect would strengthen the decoupling claim. In the revision we will insert the precise loss formulation, the optimization schedule, and a short analytical example demonstrating that the orthogonal step changes only angular distances without altering radii. revision: partial

  2. Referee: [§4 (Experiments)] §4 (Experiments): The SOTA claim is load-bearing for the contribution, yet the manuscript provides no tables with per-dataset mIoU numbers, standard deviations across runs, or ablations that isolate the radius-adjustment component from the orthogonal component. This absence prevents verification that the decoupling, rather than other implementation choices, drives the reported improvements over prior hyperbolic baselines.

    Authors: We accept that the current experimental presentation is insufficient for independent verification. The manuscript reports aggregate benchmark scores and comparisons against prior methods, but does not include the requested per-dataset breakdowns, run-wise standard deviations, or component-wise ablations. In the revised version we will add comprehensive tables listing mIoU for each dataset, standard deviations over multiple random seeds, and ablation experiments that apply radius adjustment alone, orthogonal transformation alone, and both together, thereby isolating the contribution of the proposed decoupling. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces HyRo as an independent fine-tuning framework that decouples radius adjustment (for cross-level hierarchy) from origin-centered orthogonal transformation (for within-level angular semantics) inside the Poincaré ball. The stated property that orthogonal matrices preserve the hyperbolic radius follows directly from the standard Euclidean norm invariance ||Ox|| = ||x|| and the definition of hyperbolic radius via artanh, which is a pre-existing geometric fact rather than a result derived from the paper's own fitted parameters, self-citations, or input data. No equations, predictions, or uniqueness claims reduce by construction to the method's own outputs or prior self-referential work; the central proposal remains a self-contained architectural suggestion whose performance is assessed externally on benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, invented entities, or detailed axioms are provided beyond the high-level use of hyperbolic geometry.

axioms (1)
  • domain assumption Hyperbolic geometry can model hierarchical relationships in embedding spaces.
    Abstract states that recent works leverage hyperbolic geometry to model hierarchical relationships.

pith-pipeline@v0.9.0 · 5426 in / 1260 out tokens · 48262 ms · 2026-05-12T01:12:50.818591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean dAlembert_cosh_solution_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Since the Poincaré ball model is conformal, angles measured at the origin coincide with their Euclidean counterparts... ∥x′∥ = ∥Rx∥ = ∥x∥ (due to orthogonality), the hyperbolic radius Rad x′ = Rad x remains unchanged.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    Biederman

    I. Biederman. Recognition-by-components: A theory of hu- man image understanding.Psychological Review, 94:115– 147, 1987. 3

  2. [2]

    Zero-shot semantic segmentation.Advances in Neural Information Processing Systems (NeurIPS), 32, 2019

    Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick P´erez. Zero-shot semantic segmentation.Advances in Neural Information Processing Systems (NeurIPS), 32, 2019. 6

  3. [3]

    Coco- stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 6

  4. [4]

    A. Cayley. Sur quelques propri ´et´es des d ´eterminants gauches.Journal f ¨ur die reine und angewandte Mathematik, 1846(32):119–123, 1846. 4

  5. [5]

    Cat-seg: Cost aggregation for open-vocabulary semantic segmenta- tion

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmenta- tion. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 4113–4123, 2024. 1, 2, 5, 6

  6. [6]

    Hyperbolic Image- Text Representations

    Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Ramakrishna Vedantam. Hyperbolic Image- Text Representations. InInternational Conference on Ma- chine Learning (ICML), 2023. 1, 2, 3

  7. [7]

    Embedding text in hyperbolic spaces

    Bhuwan Dhingra, Christopher Shallue, Mohammad Norouzi, Andrew Dai, and George Dahl. Embedding text in hyperbolic spaces. InProceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12), pages 59–69, 2018. 2

  8. [8]

    MeViS: A large-scale benchmark for video segmentation with motion expressions

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. MeViS: A large-scale benchmark for video segmentation with motion expressions. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023. 8

  9. [9]

    MOSE: A new dataset for video object segmentation in complex scenes

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. MOSE: A new dataset for video object segmentation in complex scenes. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023. 8

  10. [10]

    MeViS: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. MeViS: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 8

  11. [11]

    Mosev2: A more challenging dataset for video object segmentation in complex scenes.arXiv preprint arXiv:2508.05630, 2025

    Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip HS Torr, and Song Bai. MOSEv2: A more challenging dataset for video object segmentation in complex scenes.arXiv preprint arXiv:2508.05630, 2025. 8

  12. [12]

    De- coupling zero-shot semantic segmentation

    Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11583–11592, 2022. 2, 6

  13. [13]

    Hyperbolic vision transformers: Combining improvements in metric learning

    Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan Oseledets. Hyperbolic vision transformers: Combining improvements in metric learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7409–7419, 2022. 3

  14. [14]

    The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–308, 2009

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–308, 2009. 6

  15. [15]

    Hyperbolic neural networks

    Octavian Ganea, Gary Becigneul, and Thomas Hofmann. Hyperbolic neural networks. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2018. 3, 4

  16. [16]

    Scaling open-vocabulary image segmentation with image- level labels

    Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image- level labels. InEuropean Conference on Computer Vision (ECCV), pages 540–557, 2022. 6

  17. [17]

    Scaling open-vocabulary image segmentation with image- level labels

    Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image- level labels. InEuropean Conference on Computer Vision (ECCV), pages 540–557, 2022. 2, 6

  18. [18]

    Hyperbolic groups

    Mikhael Gromov. Hyperbolic groups. InEssays in group theory, pages 75–263, 1987. 2

  19. [19]

    Hyperbolic im- age embeddings

    Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Usti- nova, Ivan Oseledets, and Victor Lempitsky. Hyperbolic im- age embeddings. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2020. 3

  20. [20]

    Inferring concept hierarchies from text corpora via hyperbolic embeddings

    Matthew Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and Maximilian Nickel. Inferring concept hierarchies from text corpora via hyperbolic embeddings. InProceed- ings of the 57th Annual Meeting of the Association for Com- putational Linguistics (ACL), pages 3231–3241, 2019. 2

  21. [21]

    Language-driven semantic seg- mentation

    Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic seg- mentation. InInternational Conference on Learning Rep- resentations (ICLR), 2022. 6

  22. [23]

    Open-vocabulary semantic segmentation with mask-adapted clip

    Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7061–7070,

  23. [24]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021. 5

  24. [25]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Machine Learning (ICML), 2019. 6

  25. [26]

    On embedding trees into uniformly convex banach spaces.Israel Journal of Mathematics, 114(1):221– 237, 1999

    Ji ˇr´ı Matouˇsek. On embedding trees into uniformly convex banach spaces.Israel Journal of Mathematics, 114(1):221– 237, 1999. 2

  26. [27]

    The role of context for object detection and se- mantic segmentation in the wild

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and se- mantic segmentation in the wild. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 891–898, 2014. 6

  27. [28]

    Poincar ´e embeddings for learning hierarchical representations

    Maximillian Nickel and Douwe Kiela. Poincar ´e embeddings for learning hierarchical representations. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 2, 3

  28. [29]

    Compositional entailment learning for hyperbolic vision-language models

    Avik Pal, Max van Spengler, Guido Maria D’Amely di Me- lendugno, Alessandro Flaborea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models. InInternational Conference on Learning Representations (ICLR), 2025. 1, 2, 3

  29. [30]

    Parameter-efficient fine-tuning in hyperspherical space for open-vocabulary semantic segmen- tation

    Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yu Huang, Yaom- ing Wang, and Wei Shen. Parameter-efficient fine-tuning in hyperspherical space for open-vocabulary semantic segmen- tation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15009–15020, 2025. 2, 4

  30. [31]

    Under- standing fine-tuning clip for open-vocabulary semantic seg- mentation in hyperbolic space

    Zelin Peng, Zhengqin Xu, Zhilin Zeng, Changsong Wen, Yu Huang, Menglin Yang, Feilong Tang, and Wei Shen. Under- standing fine-tuning clip for open-vocabulary semantic seg- mentation in hyperbolic space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4562–4572, 2025. 2, 3, 5, 6, 7

  31. [32]

    Learning transferable vi- sual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable vi- sual models from natural language supervision. InInter- national Conference on Machine Learning (ICML), pages 8748–8763, 2021. 1, 2, 5, 6

  32. [33]

    Representation tradeoffs for hyperbolic embeddings

    Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re. Representation tradeoffs for hyperbolic embeddings. InIn- ternational Conference on Machine Learning (ICML), pages 4460–4469, 2018. 2

  33. [34]

    Low distortion delaunay embedding of trees in hyperbolic plane

    Rik Sarkar. Low distortion delaunay embedding of trees in hyperbolic plane. InGraph Drawing, pages 355–366, 2012. 2

  34. [35]

    Poincar´e glove: Hyperbolic word embeddings

    Alexandru Tifrea, Gary B ´ecigneul, and Octavian-Eugen Ganea. Poincar´e glove: Hyperbolic word embeddings. InIn- ternational Conference on Machine Learning (ICML), 2019. 2

  35. [36]

    Learning visual hierarchies in hyperbolic space for image re- trieval

    Ziwei Wang, Sameera Ramasinghe, Chenchen Xu, Julien Monteil, Loris Bazzani, and Thalaiyasingam Ajanthan. Learning visual hierarchies in hyperbolic space for image re- trieval. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 9924–9934, 2025. 2

  36. [37]

    Sed: A simple encoder-decoder for open- vocabulary semantic segmentation

    Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. Sed: A simple encoder-decoder for open- vocabulary semantic segmentation. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 3426–3436, 2024. 1, 2, 5, 6

  37. [38]

    Open-vocabulary panoptic segmentation with text-to-image diffusion models

    Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xi- aolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2955–2966, 2023. 1

  38. [39]

    A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model

    Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model. InEuropean Conference on Computer Vi- sion (ECCV), pages 736–753, 2022. 6

  39. [40]

    A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model

    Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model. InEuropean Conference on Computer Vi- sion (ECCV), pages 736–753, 2022. 1, 2

  40. [41]

    Side adapter network for open-vocabulary semantic segmentation

    Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2945–2954, 2023. 1, 2, 5, 6

  41. [42]

    Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip

    Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 2

  42. [43]

    Dpseg: Dual-prompt cost volume learning for open-vocabulary semantic segmentation

    Ziyu Zhao, Xiaoguang Li, Lingjia Shi, Nasrin Imanpour, and Song Wang. Dpseg: Dual-prompt cost volume learning for open-vocabulary semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25346–25356, 2025. 1, 2

  43. [44]

    Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019. 6, 7

  44. [45]

    Extract free dense labels from clip

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean Conference on Com- puter Vision (ECCV), 2022. 1, 2

  45. [46]

    Zegclip: Towards adapting clip for zero-shot seman- tic segmentation

    Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yi- fan Liu. Zegclip: Towards adapting clip for zero-shot seman- tic segmentation. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 11175–11185,