Recognition: 2 theorem links
· Lean TheoremSemantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation
Pith reviewed 2026-05-12 01:12 UTC · model grok-4.3
The pith
HyRo decouples hierarchical and semantic alignment in the Poincaré ball to advance open-vocabulary semantic segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HyRo is a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincaré ball model. It aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius, achieving state-of-the-art performance over prior methods on standard open-vocabulary semantic segmentation benchmarks.
What carries the argument
The radius-preserving orthogonal transformation in the Poincaré ball model that enables angular semantic alignment independently of hierarchical radius adjustments.
If this is right
- Hierarchical levels align when the hyperbolic radius is adjusted.
- Semantic relationships within levels improve through angular alignment without radius change.
- The combination resolves misalignments overlooked by previous hyperbolic approaches.
- Performance exceeds prior methods on open-vocabulary semantic segmentation benchmarks.
Where Pith is reading between the lines
- The radius-preserving property could let the same transformation be reused across different embedding scales without retraining the hierarchy.
- Embedding visualizations after each step might show tighter semantic clusters within levels while keeping level separations intact.
- The decoupling idea may transfer to other dense prediction tasks that mix coarse taxonomy with fine-grained labels.
Load-bearing premise
The orthogonal transformation refines semantic relationships while preserving the hyperbolic radius, and this specific decoupling fixes the within-level semantic misalignment that prior hyperbolic methods missed.
What would settle it
An ablation experiment that replaces the orthogonal transformation with a radius-altering alternative or removes it entirely, then checks whether within-level semantic consistency and benchmark scores drop to match or fall below prior hyperbolic baselines.
Figures
read the original abstract
Open-vocabulary semantic segmentation requires adapting image-level vision-language models such as CLIP to dense pixel-level prediction, which is challenging due to the mismatch between hierarchical structure and semantic alignment in the embedding space. While recent works leverage hyperbolic geometry to model hierarchical relationships, they align embeddings across hierarchical levels but overlook semantic misalignment among embeddings within the same level. In this work, we propose HyRo, a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincar\'e ball model. HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius. Experiments on standard open-vocabulary semantic segmentation benchmarks demonstrate that HyRo achieves state-of-the-art performance over prior methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HyRo, a hyperbolic fine-tuning framework for open-vocabulary semantic segmentation that operates in the Poincaré ball model. It decouples hierarchical alignment (via adjustment of the hyperbolic radius) from within-level semantic alignment (via an origin-centered orthogonal transformation). The authors claim that the orthogonal step refines semantic relationships while theoretically preserving the hyperbolic radius, and that this addresses a limitation in prior hyperbolic methods, yielding state-of-the-art results on standard benchmarks.
Significance. If the empirical results and the claimed decoupling hold under scrutiny, the work provides a clean separation of hierarchy and semantics in hyperbolic embeddings for dense prediction. The explicit invocation of the norm-preservation property of orthogonal matrices (which directly implies preservation of artanh(‖·‖)) is a genuine strength, as it supplies a parameter-free theoretical justification for the semantic-refinement step without disturbing the hierarchical structure.
major comments (2)
- [§3 (Method)] §3 (Method): The central claim that radius adjustment and the orthogonal transformation are decoupled and that the latter specifically resolves within-level semantic misalignment rests on the construction, but the manuscript supplies no explicit loss terms, optimization schedule, or small-scale derivation showing that angular alignment improves semantic metrics independently of radius changes. Without this, it is unclear whether the observed gains are attributable to the proposed decoupling rather than to generic fine-tuning.
- [§4 (Experiments)] §4 (Experiments): The SOTA claim is load-bearing for the contribution, yet the manuscript provides no tables with per-dataset mIoU numbers, standard deviations across runs, or ablations that isolate the radius-adjustment component from the orthogonal component. This absence prevents verification that the decoupling, rather than other implementation choices, drives the reported improvements over prior hyperbolic baselines.
minor comments (2)
- [Abstract] Abstract: The abstract states the SOTA result but omits the concrete benchmarks (e.g., ADE20K, Pascal-Context) and the magnitude of improvement; adding one quantitative sentence would strengthen the summary.
- [Notation] Notation: The hyperbolic radius is referred to interchangeably as “radius” and “hyperbolic radius”; a single consistent symbol (e.g., r or ρ) would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the theoretical strength of the norm-preservation argument. We respond to each major comment below and indicate the revisions we will undertake.
read point-by-point responses
-
Referee: [§3 (Method)] §3 (Method): The central claim that radius adjustment and the orthogonal transformation are decoupled and that the latter specifically resolves within-level semantic misalignment rests on the construction, but the manuscript supplies no explicit loss terms, optimization schedule, or small-scale derivation showing that angular alignment improves semantic metrics independently of radius changes. Without this, it is unclear whether the observed gains are attributable to the proposed decoupling rather than to generic fine-tuning.
Authors: Section 3 presents the procedure as two sequential operations: radius scaling first aligns hierarchical levels by moving embeddings along radial lines in the Poincaré ball, after which an origin-centered orthogonal matrix is applied to rotate embeddings while exactly preserving their Euclidean norms and therefore their hyperbolic radii (via the identity artanh(‖Qx‖) = artanh(‖x‖) for orthogonal Q). The fine-tuning objective is a standard contrastive loss applied to the final embeddings. We agree that an explicit derivation isolating the angular-alignment effect would strengthen the decoupling claim. In the revision we will insert the precise loss formulation, the optimization schedule, and a short analytical example demonstrating that the orthogonal step changes only angular distances without altering radii. revision: partial
-
Referee: [§4 (Experiments)] §4 (Experiments): The SOTA claim is load-bearing for the contribution, yet the manuscript provides no tables with per-dataset mIoU numbers, standard deviations across runs, or ablations that isolate the radius-adjustment component from the orthogonal component. This absence prevents verification that the decoupling, rather than other implementation choices, drives the reported improvements over prior hyperbolic baselines.
Authors: We accept that the current experimental presentation is insufficient for independent verification. The manuscript reports aggregate benchmark scores and comparisons against prior methods, but does not include the requested per-dataset breakdowns, run-wise standard deviations, or component-wise ablations. In the revised version we will add comprehensive tables listing mIoU for each dataset, standard deviations over multiple random seeds, and ablation experiments that apply radius adjustment alone, orthogonal transformation alone, and both together, thereby isolating the contribution of the proposed decoupling. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper introduces HyRo as an independent fine-tuning framework that decouples radius adjustment (for cross-level hierarchy) from origin-centered orthogonal transformation (for within-level angular semantics) inside the Poincaré ball. The stated property that orthogonal matrices preserve the hyperbolic radius follows directly from the standard Euclidean norm invariance ||Ox|| = ||x|| and the definition of hyperbolic radius via artanh, which is a pre-existing geometric fact rather than a result derived from the paper's own fitted parameters, self-citations, or input data. No equations, predictions, or uniqueness claims reduce by construction to the method's own outputs or prior self-referential work; the central proposal remains a self-contained architectural suggestion whose performance is assessed externally on benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hyperbolic geometry can model hierarchical relationships in embedding spaces.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leandAlembert_cosh_solution_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Since the Poincaré ball model is conformal, angles measured at the origin coincide with their Euclidean counterparts... ∥x′∥ = ∥Rx∥ = ∥x∥ (due to orthogonality), the hyperbolic radius Rad x′ = Rad x remains unchanged.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick P´erez. Zero-shot semantic segmentation.Advances in Neural Information Processing Systems (NeurIPS), 32, 2019. 6
work page 2019
-
[3]
Coco- stuff: Thing and stuff classes in context
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 6
work page 2018
-
[4]
A. Cayley. Sur quelques propri ´et´es des d ´eterminants gauches.Journal f ¨ur die reine und angewandte Mathematik, 1846(32):119–123, 1846. 4
-
[5]
Cat-seg: Cost aggregation for open-vocabulary semantic segmenta- tion
Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmenta- tion. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 4113–4123, 2024. 1, 2, 5, 6
work page 2024
-
[6]
Hyperbolic Image- Text Representations
Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Ramakrishna Vedantam. Hyperbolic Image- Text Representations. InInternational Conference on Ma- chine Learning (ICML), 2023. 1, 2, 3
work page 2023
-
[7]
Embedding text in hyperbolic spaces
Bhuwan Dhingra, Christopher Shallue, Mohammad Norouzi, Andrew Dai, and George Dahl. Embedding text in hyperbolic spaces. InProceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12), pages 59–69, 2018. 2
work page 2018
-
[8]
MeViS: A large-scale benchmark for video segmentation with motion expressions
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. MeViS: A large-scale benchmark for video segmentation with motion expressions. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023. 8
work page 2023
-
[9]
MOSE: A new dataset for video object segmentation in complex scenes
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. MOSE: A new dataset for video object segmentation in complex scenes. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023. 8
work page 2023
-
[10]
Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. MeViS: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 8
work page 2025
-
[11]
Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip HS Torr, and Song Bai. MOSEv2: A more challenging dataset for video object segmentation in complex scenes.arXiv preprint arXiv:2508.05630, 2025. 8
-
[12]
De- coupling zero-shot semantic segmentation
Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11583–11592, 2022. 2, 6
work page 2022
-
[13]
Hyperbolic vision transformers: Combining improvements in metric learning
Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan Oseledets. Hyperbolic vision transformers: Combining improvements in metric learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7409–7419, 2022. 3
work page 2022
-
[14]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–308, 2009. 6
work page 2009
-
[15]
Octavian Ganea, Gary Becigneul, and Thomas Hofmann. Hyperbolic neural networks. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2018. 3, 4
work page 2018
-
[16]
Scaling open-vocabulary image segmentation with image- level labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image- level labels. InEuropean Conference on Computer Vision (ECCV), pages 540–557, 2022. 6
work page 2022
-
[17]
Scaling open-vocabulary image segmentation with image- level labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image- level labels. InEuropean Conference on Computer Vision (ECCV), pages 540–557, 2022. 2, 6
work page 2022
-
[18]
Mikhael Gromov. Hyperbolic groups. InEssays in group theory, pages 75–263, 1987. 2
work page 1987
-
[19]
Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Usti- nova, Ivan Oseledets, and Victor Lempitsky. Hyperbolic im- age embeddings. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2020. 3
work page 2020
-
[20]
Inferring concept hierarchies from text corpora via hyperbolic embeddings
Matthew Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and Maximilian Nickel. Inferring concept hierarchies from text corpora via hyperbolic embeddings. InProceed- ings of the 57th Annual Meeting of the Association for Com- putational Linguistics (ACL), pages 3231–3241, 2019. 2
work page 2019
-
[21]
Language-driven semantic seg- mentation
Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic seg- mentation. InInternational Conference on Learning Rep- resentations (ICLR), 2022. 6
work page 2022
-
[23]
Open-vocabulary semantic segmentation with mask-adapted clip
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7061–7070,
-
[24]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021. 5
work page 2021
-
[25]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Machine Learning (ICML), 2019. 6
work page 2019
-
[26]
Ji ˇr´ı Matouˇsek. On embedding trees into uniformly convex banach spaces.Israel Journal of Mathematics, 114(1):221– 237, 1999. 2
work page 1999
-
[27]
The role of context for object detection and se- mantic segmentation in the wild
Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and se- mantic segmentation in the wild. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 891–898, 2014. 6
work page 2014
-
[28]
Poincar ´e embeddings for learning hierarchical representations
Maximillian Nickel and Douwe Kiela. Poincar ´e embeddings for learning hierarchical representations. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 2, 3
work page 2017
-
[29]
Compositional entailment learning for hyperbolic vision-language models
Avik Pal, Max van Spengler, Guido Maria D’Amely di Me- lendugno, Alessandro Flaborea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models. InInternational Conference on Learning Representations (ICLR), 2025. 1, 2, 3
work page 2025
-
[30]
Parameter-efficient fine-tuning in hyperspherical space for open-vocabulary semantic segmen- tation
Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yu Huang, Yaom- ing Wang, and Wei Shen. Parameter-efficient fine-tuning in hyperspherical space for open-vocabulary semantic segmen- tation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15009–15020, 2025. 2, 4
work page 2025
-
[31]
Under- standing fine-tuning clip for open-vocabulary semantic seg- mentation in hyperbolic space
Zelin Peng, Zhengqin Xu, Zhilin Zeng, Changsong Wen, Yu Huang, Menglin Yang, Feilong Tang, and Wei Shen. Under- standing fine-tuning clip for open-vocabulary semantic seg- mentation in hyperbolic space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4562–4572, 2025. 2, 3, 5, 6, 7
work page 2025
-
[32]
Learning transferable vi- sual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable vi- sual models from natural language supervision. InInter- national Conference on Machine Learning (ICML), pages 8748–8763, 2021. 1, 2, 5, 6
work page 2021
-
[33]
Representation tradeoffs for hyperbolic embeddings
Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re. Representation tradeoffs for hyperbolic embeddings. InIn- ternational Conference on Machine Learning (ICML), pages 4460–4469, 2018. 2
work page 2018
-
[34]
Low distortion delaunay embedding of trees in hyperbolic plane
Rik Sarkar. Low distortion delaunay embedding of trees in hyperbolic plane. InGraph Drawing, pages 355–366, 2012. 2
work page 2012
-
[35]
Poincar´e glove: Hyperbolic word embeddings
Alexandru Tifrea, Gary B ´ecigneul, and Octavian-Eugen Ganea. Poincar´e glove: Hyperbolic word embeddings. InIn- ternational Conference on Machine Learning (ICML), 2019. 2
work page 2019
-
[36]
Learning visual hierarchies in hyperbolic space for image re- trieval
Ziwei Wang, Sameera Ramasinghe, Chenchen Xu, Julien Monteil, Loris Bazzani, and Thalaiyasingam Ajanthan. Learning visual hierarchies in hyperbolic space for image re- trieval. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 9924–9934, 2025. 2
work page 2025
-
[37]
Sed: A simple encoder-decoder for open- vocabulary semantic segmentation
Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. Sed: A simple encoder-decoder for open- vocabulary semantic segmentation. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 3426–3436, 2024. 1, 2, 5, 6
work page 2024
-
[38]
Open-vocabulary panoptic segmentation with text-to-image diffusion models
Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xi- aolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2955–2966, 2023. 1
work page 2023
-
[39]
A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model
Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model. InEuropean Conference on Computer Vi- sion (ECCV), pages 736–753, 2022. 6
work page 2022
-
[40]
A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model
Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model. InEuropean Conference on Computer Vi- sion (ECCV), pages 736–753, 2022. 1, 2
work page 2022
-
[41]
Side adapter network for open-vocabulary semantic segmentation
Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2945–2954, 2023. 1, 2, 5, 6
work page 2023
-
[42]
Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip
Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 2
work page 2023
-
[43]
Dpseg: Dual-prompt cost volume learning for open-vocabulary semantic segmentation
Ziyu Zhao, Xiaoguang Li, Lingjia Shi, Nasrin Imanpour, and Song Wang. Dpseg: Dual-prompt cost volume learning for open-vocabulary semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25346–25356, 2025. 1, 2
work page 2025
-
[44]
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019. 6, 7
work page 2019
-
[45]
Extract free dense labels from clip
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean Conference on Com- puter Vision (ECCV), 2022. 1, 2
work page 2022
-
[46]
Zegclip: Towards adapting clip for zero-shot seman- tic segmentation
Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yi- fan Liu. Zegclip: Towards adapting clip for zero-shot seman- tic segmentation. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 11175–11185,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.