pith. sign in

arxiv: 2606.23843 · v1 · pith:3PULLWLZnew · submitted 2026-06-22 · 💻 cs.CV · cs.IR

HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models

Pith reviewed 2026-06-26 08:47 UTC · model grok-4.3

classification 💻 cs.CV cs.IR
keywords vision-language modelsnegationhyperbolic embeddingsangular triplet lossCLIPembedding geometry
0
0 comments X

The pith

HANCLIP restructures vision-language embeddings with hyperbolic geometry and angular separation to encode negation without overwriting pretrained knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often fail on negated descriptions because they lean on word co-occurrence patterns. HANCLIP trains on a compact set of 20,000 image-text quadruplets that pair images with both positive and negated captions. It places representations in hyperbolic space to capture hierarchical and asymmetric relations, then applies an angular triplet objective to separate negated descriptions from their positives. The design improves results on negation-specific tests while leaving classification and retrieval performance intact or better. The same method attaches to several existing CLIP-style models without full retraining.

Core claim

HANCLIP restructures the embedding space of vision-language models with a hyperbolic formulation that models hierarchical semantic relations and asymmetries, together with an angular triplet objective that drives systematic separation between negated descriptions and their corresponding positives, allowing the model to encode what an image is not alongside what it is using only 20,000 quadruplets.

What carries the argument

Hyperbolic embedding space combined with an angular triplet loss that separates negated from positive descriptions.

If this is right

  • Consistent gains appear on the negation-focused NegBench benchmark.
  • Performance stays competitive or improves on standard classification and image-text retrieval benchmarks.
  • The same training procedure attaches to CLIP, LongCLIP, SmartCLIP, and HiMo-CLIP without large-scale retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Targeted geometric objectives may prove more data-efficient than additional scale for adding specific reasoning skills to existing models.
  • The same hyperbolic-plus-angular pattern could extend to other logical constructs such as conjunction or quantification in vision-language settings.
  • Incremental geometric fine-tuning offers a route for extending model capabilities without full retraining cycles.

Load-bearing premise

The hyperbolic formulation and angular triplet objective strengthen negation sensitivity while preserving the global structure of pretrained representations rather than overwriting them.

What would settle it

A clear drop in performance on standard classification or retrieval benchmarks after the 20k-quadruplet training would show that the preservation of pretrained structure does not hold.

Figures

Figures reproduced from arXiv: 2606.23843 by Aiden Durrant, Binh T. Nguyen, Cathal Gurrin, Hoang-Bao Le, Liting Zhou, Thai Son Mai.

Figure 1
Figure 1. Figure 1: Overview of HANCLIP Framework with a quadruplet input of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We incorporate negation samples in Hyperbolic Contrastive Objective [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of CLIP and LongCLIP on NegBench with different values [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SBIR, CIR and IGROT performance across HANCLIP and CLIP ViT-L/14 based [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) are typically pre-trained on large-scale image-text datasets to capture semantic correspondences between visual content and natural language. However, they remain surprisingly brittle to negation: models often rely on shallow word co-occurrence and are easily distracted by misleading or irrelevant textual cues, even when their overall retrieval or classification performance is strong. Moreover, directly finetuning on negation data can interfere with previously acquired knowledge, causing noticeable degradation on standard vision-language benchmarks. To tackle these issues, this work introduces HANCLIP (Hyperbolic + Angular + Negation), a family of VLMs that explicitly restructures the embedding space to encode "what an image is not" alongside "what it is." HANCLIP is trained on a compact set of 20,000 image-text quadruplets and combines a hyperbolic formulation, which models hierarchical semantic relations and asymmetries, with an angular triplet objective that drives systematic separation between negated descriptions and their corresponding positives. This geometry-aware design strengthens negation sensitivity while preserving the global structure of pretrained representations, rather than overwriting them. Extensive experiments across multiple vision-language tasks show that HANCLIP delivers consistent gains on the negation-focused NegBench benchmark, while maintaining competitive or improved performance on standard classification and image-text retrieval benchmarks. The framework is model-agnostic and can be plugged into CLIP, LongCLIP, SmartCLIP, and HiMo-CLIP without large-scale retraining, demonstrating that a carefully designed geometric objective can substantially extend the reasoning capabilities of existing VLMs using only modest additional data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces HANCLIP, a family of vision-language models that augment pretrained VLMs (CLIP, LongCLIP, etc.) with a hyperbolic embedding formulation and an angular triplet loss. Trained on only 20k image-text quadruplets, the method is claimed to improve negation sensitivity on NegBench while preserving or improving performance on standard classification and retrieval benchmarks, without the degradation typically seen from direct fine-tuning on negation data. The framework is presented as model-agnostic and geometry-aware.

Significance. If the central empirical claims hold under controlled conditions, the work would be significant for demonstrating that a compact, geometry-specific objective can address a known limitation (negation brittleness) in VLMs without large-scale retraining or catastrophic forgetting. The model-agnostic plug-in design and modest data requirement would be practically useful if the hyperbolic-plus-angular combination is shown to be necessary rather than incidental.

major comments (2)
  1. [Experiments] Experiments section: the central attribution—that the combination of hyperbolic geometry and angular triplet objective (rather than either component alone or standard fine-tuning) produces the reported NegBench gains while preserving global structure—requires controlled ablations. The manuscript reports only the joint HANCLIP model; no Euclidean angular-triplet baseline or hyperbolic contrastive-loss baseline on the identical 20k quadruplets and base models is described. This leaves the synergy claim untested and the preservation argument resting on absence of degradation rather than a direct demonstration.
  2. [§3] §3 (method) and Table X (results): the claim that the approach is 'parameter-free' or that the hyperbolic component specifically enables preservation of pretrained structure is not supported by any quantitative isolation of the curvature parameter or the angular margin; if these quantities are fitted or chosen post-hoc on the same data used for evaluation, the reported improvements risk circularity with respect to the NegBench test distribution.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'without large-scale retraining' is repeated but the exact training regime (optimizer, learning rate schedule, number of epochs, whether the backbone is frozen) is not stated; this detail belongs in §4 or the appendix for reproducibility.
  2. [§3] Notation: the definition of the angular triplet loss and its hyperbolic counterpart should be given explicitly with all symbols (e.g., margin, curvature) before the experimental claims; the current description is high-level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger attribution of the reported gains and clarification on parameter handling. We address each major comment below and commit to revisions that directly strengthen the empirical claims without overstating the current manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central attribution—that the combination of hyperbolic geometry and angular triplet objective (rather than either component alone or standard fine-tuning) produces the reported NegBench gains while preserving global structure—requires controlled ablations. The manuscript reports only the joint HANCLIP model; no Euclidean angular-triplet baseline or hyperbolic contrastive-loss baseline on the identical 20k quadruplets and base models is described. This leaves the synergy claim untested and the preservation argument resting on absence of degradation rather than a direct demonstration.

    Authors: We agree that the current manuscript lacks the requested controlled ablations, which limits the strength of the attribution to the specific combination of components. In the revised version we will add results for an Euclidean angular-triplet baseline and a hyperbolic contrastive-loss baseline, each trained on the exact same 20k quadruplets and base models (CLIP, LongCLIP, etc.). These additional experiments will allow direct comparison of the joint HANCLIP objective against its constituent parts and against standard fine-tuning, thereby testing the claimed synergy and providing positive evidence for preservation of global structure rather than relying solely on lack of degradation. revision: yes

  2. Referee: [§3] §3 (method) and Table X (results): the claim that the approach is 'parameter-free' or that the hyperbolic component specifically enables preservation of pretrained structure is not supported by any quantitative isolation of the curvature parameter or the angular margin; if these quantities are fitted or chosen post-hoc on the same data used for evaluation, the reported improvements risk circularity with respect to the NegBench test distribution.

    Authors: The manuscript does not claim the method is strictly parameter-free; the curvature is fixed at the conventional value of −1 and the angular margin is selected via a small held-out validation split drawn from the 20k quadruplets that does not overlap with NegBench. Nevertheless, we acknowledge that the current text provides insufficient quantitative isolation of these choices. In revision we will add a dedicated sensitivity analysis (new table or figure) reporting performance across a range of curvature values and margin settings on the validation split, together with explicit confirmation that no test-set information from NegBench was used for selection. This will remove any appearance of circularity and directly support the claim that the hyperbolic formulation aids preservation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training objective with no self-referential derivations

full rationale

The paper presents HANCLIP as an additive training framework (hyperbolic formulation + angular triplet objective on 20k quadruplets) that is plugged into existing VLMs. No equations, fitted parameters, or first-principles derivations appear in the provided text that would reduce any claimed result to a definition or input by construction. Performance gains on NegBench and preservation on standard benchmarks are reported as experimental outcomes rather than mathematical identities. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in a way that collapses the central claim. The derivation chain is therefore self-contained as a new objective function whose effects are measured externally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only; the approach rests on standard assumptions that hyperbolic space captures hierarchical relations and that angular separation in embedding space corresponds to semantic negation. No free parameters or invented entities are named.

axioms (2)
  • domain assumption Hyperbolic geometry models hierarchical semantic relations and asymmetries in vision-language embeddings.
    Invoked to justify the choice of hyperbolic formulation over Euclidean space.
  • domain assumption An angular triplet objective produces systematic separation between negated descriptions and positives without destroying pretrained global structure.
    Central premise that the added loss improves negation while preserving other capabilities.

pith-pipeline@v0.9.1-grok · 5829 in / 1389 out tokens · 21291 ms · 2026-06-26T08:47:54.697199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 3 canonical work pages

  1. [1]

    iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval .IEEE Transactions on Pattern Analysis & Machine Intelligence, 47, 2025

    Lorenzo Agnolucci, Alberto Baldrati, Alberto Del Bimbo, and Marco Bertini. iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval .IEEE Transactions on Pattern Analysis & Machine Intelligence, 47, 2025. doi: 10.1109/TPAMI.2025.3593539

  2. [2]

    Vision-language models do not understand nega- tion

    Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip HS Torr, Yoon Kim, and Marzyeh Ghassemi. Vision-language models do not understand nega- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29612–29622, 2025

  3. [3]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

  4. [4]

    Hyperbolic image-text representations

    Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shan- mukha Ramakrishna Vedantam. Hyperbolic image-text representations. InInterna- tional Conference on Machine Learning, pages 7694–7731. PMLR, 2023

  5. [5]

    Doodle to search: Practical zero-shot sketch-based image retrieval

    Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, and Yi-Zhe Song. Doodle to search: Practical zero-shot sketch-based image retrieval. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 2179–2188, 2019

  6. [6]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- 16PREPRINT VERSION: vain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2020

  7. [7]

    Teaching structured vision & language concepts to vision & language models

    Sivan Doveh, Assaf Arbelle, Sivan Harary, Eli Schwartz, Roei Herzig, Raja Giryes, Rogerio Feris, Rameswar Panda, Shimon Ullman, and Leonid Karlinsky. Teaching structured vision & language concepts to vision & language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2657– 2668, 2023

  8. [8]

    Sugarcrepe++ dataset: Vision-language model sensi- tivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018, 2024

    Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensi- tivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018, 2024

  9. [9]

    Hyperbolic vision transformers: Combining improvements in metric learning

    Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan Os- eledets. Hyperbolic vision transformers: Combining improvements in metric learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 7409–7419, 2022

  10. [10]

    Mitigate the gap: Improving cross-modal align- ment in CLIP

    Sedigheh Eslami and Gerard de Melo. Mitigate the gap: Improving cross-modal align- ment in CLIP. InThe Thirteenth International Conference on Learning Representa- tions, 2025. URLhttps://openreview.net/forum?id=aPTGvFqile

  11. [11]

    SimCSE: Simple contrastive learn- ing of sentence embeddings

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learn- ing of sentence embeddings. InEmpirical Methods in Natural Language Processing (EMNLP), pages 6894–6910, Online and Punta Cana, Dominican Republic, 2021. As- sociation for Computational Linguistics

  12. [12]

    Cyclip: Cyclic contrastive language-image pretraining

    Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan Rossi, Vishwa Vinay, and Aditya Grover. Cyclip: Cyclic contrastive language-image pretraining. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 6704–6719, 2022

  13. [13]

    Negation-aware test- time adaptation for vision-language models.arXiv preprint arXiv:2507.19064, 2025

    Haochen Han, Alex Jinpeng Wang, Fangming Liu, and Jun Zhu. Negation-aware test- time adaptation for vision-language models.arXiv preprint arXiv:2507.19064, 2025

  14. [14]

    Nguyen Binh, Zhou Liting, and Gurrin Cathal

    Le Hoang-Bao, Tran Allie, T. Nguyen Binh, Zhou Liting, and Gurrin Cathal. Union: A lightweight target representation for efficient image-guided retrieval with optional textual queries. In2025 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2025

  15. [15]

    Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality

    Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. In Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

  16. [16]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. PREPRINT VERSION:17

  17. [17]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

  18. [18]

    Learning multiple layers of features from tiny images.(2009), 2009

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.(2009), 2009

  19. [19]

    Nguyen, Liting Zhou, and Cathal Gurrin

    Hoang-Bao Le, Allie Tran, Binh T. Nguyen, Liting Zhou, and Cathal Gurrin. Figrotd: A friendly-to-handle dataset for image guided retrieval with optional text. InMultiMedia Modeling, pages 117–132, Singapore, 2026. Springer Nature Singapore

  20. [20]

    Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InICML, 2022

  21. [21]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  22. [22]

    Lawrence Zitnick, and Piotr Dollár

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. URLhttps://arxiv.org/ abs/1405.0312

  23. [23]

    Deep sketch hash- ing: Fast free-hand sketch-based image retrieval

    Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and Ling Shao. Deep sketch hash- ing: Fast free-hand sketch-based image retrieval. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 2862–2871, 2017

  24. [24]

    C-clip: Multimodal continual learning for vision-language model

    Wenzhuo Liu, Fei Zhu, Longhui Wei, and Qi Tian. C-clip: Multimodal continual learning for vision-language model. InThe Thirteenth International Conference on Learning Representations, 2025

  25. [25]

    Zero- shot composed text-image retrieval

    Yikun Liu, Jiangchao Yao, Ya Zhang, Yan-Feng Wang, and Weidi Xie. Zero- shot composed text-image retrieval. InBMVC, page 381, 2023. URLhttp: //proceedings.bmvc2023.org/381/

  26. [26]

    Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  27. [27]

    Compositional entailment learning for hyperbolic vision-language models

    Avik Pal, Max van Spengler, Guido Maria D’Amely di Melendugno, Alessandro Fla- borea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025. URL https://openreview.net/forum?id=3i13Gev2hV

  28. [28]

    Cross- domain adversarial feature learning for sketch re-identification

    Lu Pang, Yaowei Wang, Yi-Zhe Song, Tiejun Huang, and Yonghong Tian. Cross- domain adversarial feature learning for sketch re-identification. InProceedings of the 26th ACM international conference on Multimedia, pages 609–617, 2018. 18PREPRINT VERSION:

  29. [29]

    Know ”no” better: A data-driven approach for enhancing negation awareness in clip

    Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know ”no” better: A data-driven approach for enhancing negation awareness in clip. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2825–2835, October 2025

  30. [30]

    Ins- detclip: Aligning detection model to follow human-language instruction

    Renjie Pi, Lewei Yao, Jianhua Han, Xiaodan Liang, Wei Zhang, and Hang Xu. Ins- detclip: Aligning detection model to follow human-language instruction. InThe Twelfth International Conference on Learning Representations, 2024

  31. [31]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational confer- ence on machine learning, pages 8748–8763. PMLR, 2021

  32. [32]

    Imagenet large scale visual recognition challenge.International journal of computer vision, 115 (3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115 (3):211–252, 2015

  33. [33]

    Learning the power of ”no”: Foundation models with negations

    Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati. Learning the power of ”no”: Foundation models with negations. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 7991–8001, February 2025

  34. [34]

    Hypervlm: Hyperbolic space guided vision language modeling for hierarchical multi-modal understanding

    Sarthak Srivastava and Kathy Wu. Hypervlm: Hyperbolic space guided vision language modeling for hierarchical multi-modal understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2347–2358, 2025

  35. [35]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha-clip: A clip model focusing on wherever you want. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13019–13029, 2024. doi: 10.1109/CVPR52733.2024.01237

  36. [36]

    On the brittleness of clip text encoders.arXiv preprint arXiv:2511.04247, 2025

    Allie Tran and Luca Rossetto. On the brittleness of clip text encoders.arXiv preprint arXiv:2511.04247, 2025

  37. [37]

    Siglip 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.a...

  38. [38]

    Springer Na- ture, 2022

    Abraham Ungar.A gyrovector space approach to hyperbolic geometry. Springer Na- ture, 2022

  39. [39]

    The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback.CVPR, 2021

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grau- man, and Rogerio Feris. The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback.CVPR, 2021

  40. [40]

    Himo-clip: Modeling semantic hierarchy and monotonicity in vision-language alignment

    Ruijia Wu, Ping Chen, Fei Shen, Shaoan Zhao, Qiang Hui, Huanlin Gao, Ting Lu, Zhaoxiang Liu, Fang Zhao, Kai Wang, et al. Himo-clip: Modeling semantic hierarchy and monotonicity in vision-language alignment. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 26974–26982, 2026. PREPRINT VERSION:19

  41. [41]

    Smartclip: Modular vision-language alignment with identifi- cation guarantees

    Shaoan Xie, Lingjing Lingjing, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P Xing, Guangyi Chen, and Kun Zhang. Smartclip: Modular vision-language alignment with identifi- cation guarantees. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29780–29790, 2025

  42. [42]

    Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection.Advances in Neural Information Processing Systems, 35:9125–9138, 2022

    Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection.Advances in Neural Information Processing Systems, 35:9125–9138, 2022

  43. [43]

    Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment

    Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, and Hang Xu. Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23497–23506, 2023

  44. [44]

    Detclipv3: Towards versatile generative open-vocabulary object de- tection

    Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. Detclipv3: Towards versatile generative open-vocabulary object de- tection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27391–27401, 2024

  45. [45]

    A zero-shot framework for sketch based image retrieval

    Sasi Kiran Yelamarthi, Shiva Krishna Reddy, Ashish Mishra, and Anurag Mittal. A zero-shot framework for sketch based image retrieval. InProceedings of the European Conference on Computer Vision (ECCV), pages 300–317, 2018

  46. [46]

    Phyclip:ℓ 1-product of hyperbolic fac- tors unifies hierarchy and compositionality in vision-language representation learning,

    Daiki Yoshikawa and Takashi Matsubara. Phyclip:ℓ 1-product of hyperbolic fac- tors unifies hierarchy and compositionality in vision-language representation learning,

  47. [47]

    URLhttps://arxiv.org/abs/2510.08919

  48. [48]

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descrip- tions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the association for computational linguistics, 2:67–78, 2014

  49. [49]

    When and why vision-language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations,

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations,

  50. [50]

    URLhttps://openreview.net/forum?id=KRLUvxh8uaX

  51. [51]

    Low-rank few-shot adaptation of vision- language models

    Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision- language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1593–1603, 2024

  52. [52]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 11975–11986, 2023

  53. [53]

    Long-clip: Unlocking the long-text capability of clip

    Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InEuropean conference on computer vision, pages 310–325. Springer, 2024. 20PREPRINT VERSION:

  54. [54]

    Sketchnet: Sketch classification with web images

    Hua Zhang, Si Liu, Changqing Zhang, Wenqi Ren, Rui Wang, and Xiaochun Cao. Sketchnet: Sketch classification with web images. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 1105–1113, 2016

  55. [55]

    Magiclens: Self-supervised image retrieval with open-ended instructions

    Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. InThe Forty-first International Conference on Machine Learning (ICML), page to appear, 2024

  56. [56]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13774–13784, 2024. doi: 10.1109/CVPR52733.2024.01307

  57. [57]

    Rankclip: Ranking-consistent language-image pretraining

    Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, and Yining Sun. Rankclip: Ranking-consistent language-image pretraining. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3874–3884, October 2025. 6 Ablation Studies (Cont.) 6.1 Impact of Weightα 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 342.5 345.0 34...