arxiv: 2605.08874 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Semantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation

Hoang M. Truong , Hai Nguyen-Truong , Dang Huynh

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary semantic segmentationhyperbolic geometryPoincaré ball modelCLIP adaptationhierarchical alignmentsemantic alignmentorthogonal transformationfine-tuning framework

0 comments

The pith

HyRo decouples hierarchical and semantic alignment in the Poincaré ball to advance open-vocabulary semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of adapting image-level vision-language models like CLIP to pixel-level open-vocabulary semantic segmentation, where embedding spaces struggle with both hierarchy and precise meaning. Prior hyperbolic methods capture hierarchical structures but leave semantic misalignments unaddressed within the same level. HyRo introduces a fine-tuning approach that separates these concerns by tuning the hyperbolic radius to align levels and applying an orthogonal transformation to refine semantics while keeping that radius fixed. This separation leads to measurable gains on standard benchmarks over earlier techniques.

Core claim

HyRo is a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincaré ball model. It aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius, achieving state-of-the-art performance over prior methods on standard open-vocabulary semantic segmentation benchmarks.

What carries the argument

The radius-preserving orthogonal transformation in the Poincaré ball model that enables angular semantic alignment independently of hierarchical radius adjustments.

If this is right

Hierarchical levels align when the hyperbolic radius is adjusted.
Semantic relationships within levels improve through angular alignment without radius change.
The combination resolves misalignments overlooked by previous hyperbolic approaches.
Performance exceeds prior methods on open-vocabulary semantic segmentation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The radius-preserving property could let the same transformation be reused across different embedding scales without retraining the hierarchy.
Embedding visualizations after each step might show tighter semantic clusters within levels while keeping level separations intact.
The decoupling idea may transfer to other dense prediction tasks that mix coarse taxonomy with fine-grained labels.

Load-bearing premise

The orthogonal transformation refines semantic relationships while preserving the hyperbolic radius, and this specific decoupling fixes the within-level semantic misalignment that prior hyperbolic methods missed.

What would settle it

An ablation experiment that replaces the orthogonal transformation with a radius-altering alternative or removes it entirely, then checks whether within-level semantic consistency and benchmark scores drop to match or fall below prior hyperbolic baselines.

Figures

Figures reproduced from arXiv: 2605.08874 by Dang Huynh, Hai Nguyen-Truong, Hoang M. Truong.

**Figure 2.** Figure 2: Overview of HyRo. Embeddings are rotated around [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overall architecture of HyRo. Given image and text inputs, Euclidean embeddings are mapped to the Poincare ball via ´ the exponential map. HyRo then decouples alignment into two stages: (1) Hierarchical Adjustment using block-diagonal radius scaling matrices to align granularity, and (2) Semantic Refinement using orthogonal rotation matrices to adjust angular relationships without altering the radius. The … view at source ↗

**Figure 4.** Figure 4: Qualitative comparison between HyperCLIP [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Attention map visualization for target classes “person”, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Open-vocabulary semantic segmentation requires adapting image-level vision-language models such as CLIP to dense pixel-level prediction, which is challenging due to the mismatch between hierarchical structure and semantic alignment in the embedding space. While recent works leverage hyperbolic geometry to model hierarchical relationships, they align embeddings across hierarchical levels but overlook semantic misalignment among embeddings within the same level. In this work, we propose HyRo, a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincar\'e ball model. HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius. Experiments on standard open-vocabulary semantic segmentation benchmarks demonstrate that HyRo achieves state-of-the-art performance over prior methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HyRo cleanly decouples radius-based hierarchy alignment from within-level semantic refinement via a radius-preserving orthogonal transform in the Poincaré ball, but the SOTA claim sits on unshown experiments.

read the letter

The main thing here is that HyRo splits the alignment problem: it tweaks the hyperbolic radius to handle cross-level hierarchy and then applies an origin-centered orthogonal matrix to fix semantic relationships inside the same level. The stress-test note is right that any orthogonal matrix keeps the Euclidean norm, so the radius (and thus the hyperbolic distance from origin) stays intact. That part of the construction has no internal contradiction and directly targets the gap the abstract flags in prior hyperbolic work on vision-language models for segmentation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HyRo, a hyperbolic fine-tuning framework for open-vocabulary semantic segmentation that operates in the Poincaré ball model. It decouples hierarchical alignment (via adjustment of the hyperbolic radius) from within-level semantic alignment (via an origin-centered orthogonal transformation). The authors claim that the orthogonal step refines semantic relationships while theoretically preserving the hyperbolic radius, and that this addresses a limitation in prior hyperbolic methods, yielding state-of-the-art results on standard benchmarks.

Significance. If the empirical results and the claimed decoupling hold under scrutiny, the work provides a clean separation of hierarchy and semantics in hyperbolic embeddings for dense prediction. The explicit invocation of the norm-preservation property of orthogonal matrices (which directly implies preservation of artanh(‖·‖)) is a genuine strength, as it supplies a parameter-free theoretical justification for the semantic-refinement step without disturbing the hierarchical structure.

major comments (2)

[§3 (Method)] §3 (Method): The central claim that radius adjustment and the orthogonal transformation are decoupled and that the latter specifically resolves within-level semantic misalignment rests on the construction, but the manuscript supplies no explicit loss terms, optimization schedule, or small-scale derivation showing that angular alignment improves semantic metrics independently of radius changes. Without this, it is unclear whether the observed gains are attributable to the proposed decoupling rather than to generic fine-tuning.
[§4 (Experiments)] §4 (Experiments): The SOTA claim is load-bearing for the contribution, yet the manuscript provides no tables with per-dataset mIoU numbers, standard deviations across runs, or ablations that isolate the radius-adjustment component from the orthogonal component. This absence prevents verification that the decoupling, rather than other implementation choices, drives the reported improvements over prior hyperbolic baselines.

minor comments (2)

[Abstract] Abstract: The abstract states the SOTA result but omits the concrete benchmarks (e.g., ADE20K, Pascal-Context) and the magnitude of improvement; adding one quantitative sentence would strengthen the summary.
[Notation] Notation: The hyperbolic radius is referred to interchangeably as “radius” and “hyperbolic radius”; a single consistent symbol (e.g., r or ρ) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the theoretical strength of the norm-preservation argument. We respond to each major comment below and indicate the revisions we will undertake.

read point-by-point responses

Referee: [§3 (Method)] §3 (Method): The central claim that radius adjustment and the orthogonal transformation are decoupled and that the latter specifically resolves within-level semantic misalignment rests on the construction, but the manuscript supplies no explicit loss terms, optimization schedule, or small-scale derivation showing that angular alignment improves semantic metrics independently of radius changes. Without this, it is unclear whether the observed gains are attributable to the proposed decoupling rather than to generic fine-tuning.

Authors: Section 3 presents the procedure as two sequential operations: radius scaling first aligns hierarchical levels by moving embeddings along radial lines in the Poincaré ball, after which an origin-centered orthogonal matrix is applied to rotate embeddings while exactly preserving their Euclidean norms and therefore their hyperbolic radii (via the identity artanh(‖Qx‖) = artanh(‖x‖) for orthogonal Q). The fine-tuning objective is a standard contrastive loss applied to the final embeddings. We agree that an explicit derivation isolating the angular-alignment effect would strengthen the decoupling claim. In the revision we will insert the precise loss formulation, the optimization schedule, and a short analytical example demonstrating that the orthogonal step changes only angular distances without altering radii. revision: partial
Referee: [§4 (Experiments)] §4 (Experiments): The SOTA claim is load-bearing for the contribution, yet the manuscript provides no tables with per-dataset mIoU numbers, standard deviations across runs, or ablations that isolate the radius-adjustment component from the orthogonal component. This absence prevents verification that the decoupling, rather than other implementation choices, drives the reported improvements over prior hyperbolic baselines.

Authors: We accept that the current experimental presentation is insufficient for independent verification. The manuscript reports aggregate benchmark scores and comparisons against prior methods, but does not include the requested per-dataset breakdowns, run-wise standard deviations, or component-wise ablations. In the revised version we will add comprehensive tables listing mIoU for each dataset, standard deviations over multiple random seeds, and ablation experiments that apply radius adjustment alone, orthogonal transformation alone, and both together, thereby isolating the contribution of the proposed decoupling. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces HyRo as an independent fine-tuning framework that decouples radius adjustment (for cross-level hierarchy) from origin-centered orthogonal transformation (for within-level angular semantics) inside the Poincaré ball. The stated property that orthogonal matrices preserve the hyperbolic radius follows directly from the standard Euclidean norm invariance ||Ox|| = ||x|| and the definition of hyperbolic radius via artanh, which is a pre-existing geometric fact rather than a result derived from the paper's own fitted parameters, self-citations, or input data. No equations, predictions, or uniqueness claims reduce by construction to the method's own outputs or prior self-referential work; the central proposal remains a self-contained architectural suggestion whose performance is assessed externally on benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, invented entities, or detailed axioms are provided beyond the high-level use of hyperbolic geometry.

axioms (1)

domain assumption Hyperbolic geometry can model hierarchical relationships in embedding spaces.
Abstract states that recent works leverage hyperbolic geometry to model hierarchical relationships.

pith-pipeline@v0.9.0 · 5426 in / 1260 out tokens · 48262 ms · 2026-05-12T01:12:50.818591+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean dAlembert_cosh_solution_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Since the Poincaré ball model is conformal, angles measured at the origin coincide with their Euclidean counterparts... ∥x′∥ = ∥Rx∥ = ∥x∥ (due to orthogonality), the hyperbolic radius Rad x′ = Rad x remains unchanged.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

Biederman

I. Biederman. Recognition-by-components: A theory of hu- man image understanding.Psychological Review, 94:115– 147, 1987. 3

work page 1987
[2]

Zero-shot semantic segmentation.Advances in Neural Information Processing Systems (NeurIPS), 32, 2019

Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick P´erez. Zero-shot semantic segmentation.Advances in Neural Information Processing Systems (NeurIPS), 32, 2019. 6

work page 2019
[3]

Coco- stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 6

work page 2018
[4]

A. Cayley. Sur quelques propri ´et´es des d ´eterminants gauches.Journal f ¨ur die reine und angewandte Mathematik, 1846(32):119–123, 1846. 4

work page
[5]

Cat-seg: Cost aggregation for open-vocabulary semantic segmenta- tion

Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmenta- tion. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 4113–4123, 2024. 1, 2, 5, 6

work page 2024
[6]

Hyperbolic Image- Text Representations

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Ramakrishna Vedantam. Hyperbolic Image- Text Representations. InInternational Conference on Ma- chine Learning (ICML), 2023. 1, 2, 3

work page 2023
[7]

Embedding text in hyperbolic spaces

Bhuwan Dhingra, Christopher Shallue, Mohammad Norouzi, Andrew Dai, and George Dahl. Embedding text in hyperbolic spaces. InProceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12), pages 59–69, 2018. 2

work page 2018
[8]

MeViS: A large-scale benchmark for video segmentation with motion expressions

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. MeViS: A large-scale benchmark for video segmentation with motion expressions. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023. 8

work page 2023
[9]

MOSE: A new dataset for video object segmentation in complex scenes

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. MOSE: A new dataset for video object segmentation in complex scenes. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023. 8

work page 2023
[10]

MeViS: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. MeViS: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 8

work page 2025
[11]

Mosev2: A more challenging dataset for video object segmentation in complex scenes.arXiv preprint arXiv:2508.05630, 2025

Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip HS Torr, and Song Bai. MOSEv2: A more challenging dataset for video object segmentation in complex scenes.arXiv preprint arXiv:2508.05630, 2025. 8

work page arXiv 2025
[12]

De- coupling zero-shot semantic segmentation

Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11583–11592, 2022. 2, 6

work page 2022
[13]

Hyperbolic vision transformers: Combining improvements in metric learning

Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan Oseledets. Hyperbolic vision transformers: Combining improvements in metric learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7409–7419, 2022. 3

work page 2022
[14]

The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–308, 2009

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–308, 2009. 6

work page 2009
[15]

Hyperbolic neural networks

Octavian Ganea, Gary Becigneul, and Thomas Hofmann. Hyperbolic neural networks. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2018. 3, 4

work page 2018
[16]

Scaling open-vocabulary image segmentation with image- level labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image- level labels. InEuropean Conference on Computer Vision (ECCV), pages 540–557, 2022. 6

work page 2022
[17]

Scaling open-vocabulary image segmentation with image- level labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image- level labels. InEuropean Conference on Computer Vision (ECCV), pages 540–557, 2022. 2, 6

work page 2022
[18]

Hyperbolic groups

Mikhael Gromov. Hyperbolic groups. InEssays in group theory, pages 75–263, 1987. 2

work page 1987
[19]

Hyperbolic im- age embeddings

Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Usti- nova, Ivan Oseledets, and Victor Lempitsky. Hyperbolic im- age embeddings. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2020. 3

work page 2020
[20]

Inferring concept hierarchies from text corpora via hyperbolic embeddings

Matthew Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and Maximilian Nickel. Inferring concept hierarchies from text corpora via hyperbolic embeddings. InProceed- ings of the 57th Annual Meeting of the Association for Com- putational Linguistics (ACL), pages 3231–3241, 2019. 2

work page 2019
[21]

Language-driven semantic seg- mentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic seg- mentation. InInternational Conference on Learning Rep- resentations (ICLR), 2022. 6

work page 2022
[23]

Open-vocabulary semantic segmentation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7061–7070,

work page
[24]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021. 5

work page 2021
[25]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Machine Learning (ICML), 2019. 6

work page 2019
[26]

On embedding trees into uniformly convex banach spaces.Israel Journal of Mathematics, 114(1):221– 237, 1999

Ji ˇr´ı Matouˇsek. On embedding trees into uniformly convex banach spaces.Israel Journal of Mathematics, 114(1):221– 237, 1999. 2

work page 1999
[27]

The role of context for object detection and se- mantic segmentation in the wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and se- mantic segmentation in the wild. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 891–898, 2014. 6

work page 2014
[28]

Poincar ´e embeddings for learning hierarchical representations

Maximillian Nickel and Douwe Kiela. Poincar ´e embeddings for learning hierarchical representations. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 2, 3

work page 2017
[29]

Compositional entailment learning for hyperbolic vision-language models

Avik Pal, Max van Spengler, Guido Maria D’Amely di Me- lendugno, Alessandro Flaborea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models. InInternational Conference on Learning Representations (ICLR), 2025. 1, 2, 3

work page 2025
[30]

Parameter-efficient fine-tuning in hyperspherical space for open-vocabulary semantic segmen- tation

Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yu Huang, Yaom- ing Wang, and Wei Shen. Parameter-efficient fine-tuning in hyperspherical space for open-vocabulary semantic segmen- tation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15009–15020, 2025. 2, 4

work page 2025
[31]

Under- standing fine-tuning clip for open-vocabulary semantic seg- mentation in hyperbolic space

Zelin Peng, Zhengqin Xu, Zhilin Zeng, Changsong Wen, Yu Huang, Menglin Yang, Feilong Tang, and Wei Shen. Under- standing fine-tuning clip for open-vocabulary semantic seg- mentation in hyperbolic space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4562–4572, 2025. 2, 3, 5, 6, 7

work page 2025
[32]

Learning transferable vi- sual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable vi- sual models from natural language supervision. InInter- national Conference on Machine Learning (ICML), pages 8748–8763, 2021. 1, 2, 5, 6

work page 2021
[33]

Representation tradeoffs for hyperbolic embeddings

Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re. Representation tradeoffs for hyperbolic embeddings. InIn- ternational Conference on Machine Learning (ICML), pages 4460–4469, 2018. 2

work page 2018
[34]

Low distortion delaunay embedding of trees in hyperbolic plane

Rik Sarkar. Low distortion delaunay embedding of trees in hyperbolic plane. InGraph Drawing, pages 355–366, 2012. 2

work page 2012
[35]

Poincar´e glove: Hyperbolic word embeddings

Alexandru Tifrea, Gary B ´ecigneul, and Octavian-Eugen Ganea. Poincar´e glove: Hyperbolic word embeddings. InIn- ternational Conference on Machine Learning (ICML), 2019. 2

work page 2019
[36]

Learning visual hierarchies in hyperbolic space for image re- trieval

Ziwei Wang, Sameera Ramasinghe, Chenchen Xu, Julien Monteil, Loris Bazzani, and Thalaiyasingam Ajanthan. Learning visual hierarchies in hyperbolic space for image re- trieval. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 9924–9934, 2025. 2

work page 2025
[37]

Sed: A simple encoder-decoder for open- vocabulary semantic segmentation

Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. Sed: A simple encoder-decoder for open- vocabulary semantic segmentation. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 3426–3436, 2024. 1, 2, 5, 6

work page 2024
[38]

Open-vocabulary panoptic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xi- aolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2955–2966, 2023. 1

work page 2023
[39]

A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model

Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model. InEuropean Conference on Computer Vi- sion (ECCV), pages 736–753, 2022. 6

work page 2022
[40]

A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model

Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model. InEuropean Conference on Computer Vi- sion (ECCV), pages 736–753, 2022. 1, 2

work page 2022
[41]

Side adapter network for open-vocabulary semantic segmentation

Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2945–2954, 2023. 1, 2, 5, 6

work page 2023
[42]

Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 2

work page 2023
[43]

Dpseg: Dual-prompt cost volume learning for open-vocabulary semantic segmentation

Ziyu Zhao, Xiaoguang Li, Lingjia Shi, Nasrin Imanpour, and Song Wang. Dpseg: Dual-prompt cost volume learning for open-vocabulary semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25346–25356, 2025. 1, 2

work page 2025
[44]

Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019. 6, 7

work page 2019
[45]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean Conference on Com- puter Vision (ECCV), 2022. 1, 2

work page 2022
[46]

Zegclip: Towards adapting clip for zero-shot seman- tic segmentation

Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yi- fan Liu. Zegclip: Towards adapting clip for zero-shot seman- tic segmentation. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 11175–11185,

work page