HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models

Aiden Durrant; Binh T. Nguyen; Cathal Gurrin; Hoang-Bao Le; Liting Zhou; Thai Son Mai

arxiv: 2606.23843 · v1 · pith:3PULLWLZnew · submitted 2026-06-22 · 💻 cs.CV · cs.IR

HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models

Hoang-Bao Le , Aiden Durrant , Thai Son Mai , Binh T. Nguyen , Liting Zhou , Cathal Gurrin This is my paper

Pith reviewed 2026-06-26 08:47 UTC · model grok-4.3

classification 💻 cs.CV cs.IR

keywords vision-language modelsnegationhyperbolic embeddingsangular triplet lossCLIPembedding geometry

0 comments

The pith

HANCLIP restructures vision-language embeddings with hyperbolic geometry and angular separation to encode negation without overwriting pretrained knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often fail on negated descriptions because they lean on word co-occurrence patterns. HANCLIP trains on a compact set of 20,000 image-text quadruplets that pair images with both positive and negated captions. It places representations in hyperbolic space to capture hierarchical and asymmetric relations, then applies an angular triplet objective to separate negated descriptions from their positives. The design improves results on negation-specific tests while leaving classification and retrieval performance intact or better. The same method attaches to several existing CLIP-style models without full retraining.

Core claim

HANCLIP restructures the embedding space of vision-language models with a hyperbolic formulation that models hierarchical semantic relations and asymmetries, together with an angular triplet objective that drives systematic separation between negated descriptions and their corresponding positives, allowing the model to encode what an image is not alongside what it is using only 20,000 quadruplets.

What carries the argument

Hyperbolic embedding space combined with an angular triplet loss that separates negated from positive descriptions.

If this is right

Consistent gains appear on the negation-focused NegBench benchmark.
Performance stays competitive or improves on standard classification and image-text retrieval benchmarks.
The same training procedure attaches to CLIP, LongCLIP, SmartCLIP, and HiMo-CLIP without large-scale retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Targeted geometric objectives may prove more data-efficient than additional scale for adding specific reasoning skills to existing models.
The same hyperbolic-plus-angular pattern could extend to other logical constructs such as conjunction or quantification in vision-language settings.
Incremental geometric fine-tuning offers a route for extending model capabilities without full retraining cycles.

Load-bearing premise

The hyperbolic formulation and angular triplet objective strengthen negation sensitivity while preserving the global structure of pretrained representations rather than overwriting them.

What would settle it

A clear drop in performance on standard classification or retrieval benchmarks after the 20k-quadruplet training would show that the preservation of pretrained structure does not hold.

Figures

Figures reproduced from arXiv: 2606.23843 by Aiden Durrant, Binh T. Nguyen, Cathal Gurrin, Hoang-Bao Le, Liting Zhou, Thai Son Mai.

**Figure 2.** Figure 2: We incorporate negation samples in Hyperbolic Contrastive Objective [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of CLIP and LongCLIP on NegBench with different values [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: SBIR, CIR and IGROT performance across HANCLIP and CLIP ViT-L/14 based [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) are typically pre-trained on large-scale image-text datasets to capture semantic correspondences between visual content and natural language. However, they remain surprisingly brittle to negation: models often rely on shallow word co-occurrence and are easily distracted by misleading or irrelevant textual cues, even when their overall retrieval or classification performance is strong. Moreover, directly finetuning on negation data can interfere with previously acquired knowledge, causing noticeable degradation on standard vision-language benchmarks. To tackle these issues, this work introduces HANCLIP (Hyperbolic + Angular + Negation), a family of VLMs that explicitly restructures the embedding space to encode "what an image is not" alongside "what it is." HANCLIP is trained on a compact set of 20,000 image-text quadruplets and combines a hyperbolic formulation, which models hierarchical semantic relations and asymmetries, with an angular triplet objective that drives systematic separation between negated descriptions and their corresponding positives. This geometry-aware design strengthens negation sensitivity while preserving the global structure of pretrained representations, rather than overwriting them. Extensive experiments across multiple vision-language tasks show that HANCLIP delivers consistent gains on the negation-focused NegBench benchmark, while maintaining competitive or improved performance on standard classification and image-text retrieval benchmarks. The framework is model-agnostic and can be plugged into CLIP, LongCLIP, SmartCLIP, and HiMo-CLIP without large-scale retraining, demonstrating that a carefully designed geometric objective can substantially extend the reasoning capabilities of existing VLMs using only modest additional data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HANCLIP pairs hyperbolic embeddings with an angular triplet loss to improve negation in VLMs on 20k examples, but lacks the ablations needed to show the combination is what drives the gains.

read the letter

The paper's main contribution is a training objective that puts VLMs into hyperbolic space and adds an angular triplet term so the model learns to separate negated descriptions from positives. They apply this on top of several existing models using only 20,000 quadruplets and report better results on NegBench while holding or improving standard classification and retrieval numbers.

The combination itself looks new relative to the cited work, and the authors correctly flag that plain fine-tuning on negation data tends to hurt other capabilities. Framing the fix as a lightweight geometric change rather than full retraining is a practical angle.

The clearest weakness is the missing ablations. The central claim is that the hyperbolic-plus-angular design produces the negation lift without overwriting the original representation; yet the abstract gives no Euclidean angular-triplet baseline or hyperbolic contrastive baseline on the same data and models. Without those controls it is hard to know whether the reported improvement comes from the specific geometry or simply from seeing negation examples at all. The preservation claim also rests on the absence of degradation rather than a direct demonstration that the geometry is what protects the global structure.

This is the kind of paper that would interest people working on geometric modifications to multimodal embeddings or on making VLMs more logically robust. A reader who already cares about negation benchmarks or hyperbolic methods could extract something useful if the full experiments are clean.

I would send it to peer review. The problem is real, the proposed fix is coherent on its own terms, and the main open question (whether the two components are jointly necessary) is exactly the sort of thing referees can press on with the full tables and code.

Referee Report

2 major / 2 minor

Summary. The paper introduces HANCLIP, a family of vision-language models that augment pretrained VLMs (CLIP, LongCLIP, etc.) with a hyperbolic embedding formulation and an angular triplet loss. Trained on only 20k image-text quadruplets, the method is claimed to improve negation sensitivity on NegBench while preserving or improving performance on standard classification and retrieval benchmarks, without the degradation typically seen from direct fine-tuning on negation data. The framework is presented as model-agnostic and geometry-aware.

Significance. If the central empirical claims hold under controlled conditions, the work would be significant for demonstrating that a compact, geometry-specific objective can address a known limitation (negation brittleness) in VLMs without large-scale retraining or catastrophic forgetting. The model-agnostic plug-in design and modest data requirement would be practically useful if the hyperbolic-plus-angular combination is shown to be necessary rather than incidental.

major comments (2)

[Experiments] Experiments section: the central attribution—that the combination of hyperbolic geometry and angular triplet objective (rather than either component alone or standard fine-tuning) produces the reported NegBench gains while preserving global structure—requires controlled ablations. The manuscript reports only the joint HANCLIP model; no Euclidean angular-triplet baseline or hyperbolic contrastive-loss baseline on the identical 20k quadruplets and base models is described. This leaves the synergy claim untested and the preservation argument resting on absence of degradation rather than a direct demonstration.
[§3] §3 (method) and Table X (results): the claim that the approach is 'parameter-free' or that the hyperbolic component specifically enables preservation of pretrained structure is not supported by any quantitative isolation of the curvature parameter or the angular margin; if these quantities are fitted or chosen post-hoc on the same data used for evaluation, the reported improvements risk circularity with respect to the NegBench test distribution.

minor comments (2)

[Abstract] Abstract: the phrase 'without large-scale retraining' is repeated but the exact training regime (optimizer, learning rate schedule, number of epochs, whether the backbone is frozen) is not stated; this detail belongs in §4 or the appendix for reproducibility.
[§3] Notation: the definition of the angular triplet loss and its hyperbolic counterpart should be given explicitly with all symbols (e.g., margin, curvature) before the experimental claims; the current description is high-level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger attribution of the reported gains and clarification on parameter handling. We address each major comment below and commit to revisions that directly strengthen the empirical claims without overstating the current manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the central attribution—that the combination of hyperbolic geometry and angular triplet objective (rather than either component alone or standard fine-tuning) produces the reported NegBench gains while preserving global structure—requires controlled ablations. The manuscript reports only the joint HANCLIP model; no Euclidean angular-triplet baseline or hyperbolic contrastive-loss baseline on the identical 20k quadruplets and base models is described. This leaves the synergy claim untested and the preservation argument resting on absence of degradation rather than a direct demonstration.

Authors: We agree that the current manuscript lacks the requested controlled ablations, which limits the strength of the attribution to the specific combination of components. In the revised version we will add results for an Euclidean angular-triplet baseline and a hyperbolic contrastive-loss baseline, each trained on the exact same 20k quadruplets and base models (CLIP, LongCLIP, etc.). These additional experiments will allow direct comparison of the joint HANCLIP objective against its constituent parts and against standard fine-tuning, thereby testing the claimed synergy and providing positive evidence for preservation of global structure rather than relying solely on lack of degradation. revision: yes
Referee: [§3] §3 (method) and Table X (results): the claim that the approach is 'parameter-free' or that the hyperbolic component specifically enables preservation of pretrained structure is not supported by any quantitative isolation of the curvature parameter or the angular margin; if these quantities are fitted or chosen post-hoc on the same data used for evaluation, the reported improvements risk circularity with respect to the NegBench test distribution.

Authors: The manuscript does not claim the method is strictly parameter-free; the curvature is fixed at the conventional value of −1 and the angular margin is selected via a small held-out validation split drawn from the 20k quadruplets that does not overlap with NegBench. Nevertheless, we acknowledge that the current text provides insufficient quantitative isolation of these choices. In revision we will add a dedicated sensitivity analysis (new table or figure) reporting performance across a range of curvature values and margin settings on the validation split, together with explicit confirmation that no test-set information from NegBench was used for selection. This will remove any appearance of circularity and directly support the claim that the hyperbolic formulation aids preservation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training objective with no self-referential derivations

full rationale

The paper presents HANCLIP as an additive training framework (hyperbolic formulation + angular triplet objective on 20k quadruplets) that is plugged into existing VLMs. No equations, fitted parameters, or first-principles derivations appear in the provided text that would reduce any claimed result to a definition or input by construction. Performance gains on NegBench and preservation on standard benchmarks are reported as experimental outcomes rather than mathematical identities. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in a way that collapses the central claim. The derivation chain is therefore self-contained as a new objective function whose effects are measured externally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only; the approach rests on standard assumptions that hyperbolic space captures hierarchical relations and that angular separation in embedding space corresponds to semantic negation. No free parameters or invented entities are named.

axioms (2)

domain assumption Hyperbolic geometry models hierarchical semantic relations and asymmetries in vision-language embeddings.
Invoked to justify the choice of hyperbolic formulation over Euclidean space.
domain assumption An angular triplet objective produces systematic separation between negated descriptions and positives without destroying pretrained global structure.
Central premise that the added loss improves negation while preserving other capabilities.

pith-pipeline@v0.9.1-grok · 5829 in / 1389 out tokens · 21291 ms · 2026-06-26T08:47:54.697199+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 3 canonical work pages

[1]

iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval .IEEE Transactions on Pattern Analysis & Machine Intelligence, 47, 2025

Lorenzo Agnolucci, Alberto Baldrati, Alberto Del Bimbo, and Marco Bertini. iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval .IEEE Transactions on Pattern Analysis & Machine Intelligence, 47, 2025. doi: 10.1109/TPAMI.2025.3593539

work page doi:10.1109/tpami.2025.3593539 2025
[2]

Vision-language models do not understand nega- tion

Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip HS Torr, Yoon Kim, and Marzyeh Ghassemi. Vision-language models do not understand nega- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29612–29622, 2025

2025
[3]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

2024
[4]

Hyperbolic image-text representations

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shan- mukha Ramakrishna Vedantam. Hyperbolic image-text representations. InInterna- tional Conference on Machine Learning, pages 7694–7731. PMLR, 2023

2023
[5]

Doodle to search: Practical zero-shot sketch-based image retrieval

Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, and Yi-Zhe Song. Doodle to search: Practical zero-shot sketch-based image retrieval. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 2179–2188, 2019

2019
[6]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- 16PREPRINT VERSION: vain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2020

2020
[7]

Teaching structured vision & language concepts to vision & language models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Eli Schwartz, Roei Herzig, Raja Giryes, Rogerio Feris, Rameswar Panda, Shimon Ullman, and Leonid Karlinsky. Teaching structured vision & language concepts to vision & language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2657– 2668, 2023

2023
[8]

Sugarcrepe++ dataset: Vision-language model sensi- tivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018, 2024

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensi- tivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018, 2024

2024
[9]

Hyperbolic vision transformers: Combining improvements in metric learning

Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan Os- eledets. Hyperbolic vision transformers: Combining improvements in metric learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 7409–7419, 2022

2022
[10]

Mitigate the gap: Improving cross-modal align- ment in CLIP

Sedigheh Eslami and Gerard de Melo. Mitigate the gap: Improving cross-modal align- ment in CLIP. InThe Thirteenth International Conference on Learning Representa- tions, 2025. URLhttps://openreview.net/forum?id=aPTGvFqile

2025
[11]

SimCSE: Simple contrastive learn- ing of sentence embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learn- ing of sentence embeddings. InEmpirical Methods in Natural Language Processing (EMNLP), pages 6894–6910, Online and Punta Cana, Dominican Republic, 2021. As- sociation for Computational Linguistics

2021
[12]

Cyclip: Cyclic contrastive language-image pretraining

Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan Rossi, Vishwa Vinay, and Aditya Grover. Cyclip: Cyclic contrastive language-image pretraining. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 6704–6719, 2022

2022
[13]

Negation-aware test- time adaptation for vision-language models.arXiv preprint arXiv:2507.19064, 2025

Haochen Han, Alex Jinpeng Wang, Fangming Liu, and Jun Zhu. Negation-aware test- time adaptation for vision-language models.arXiv preprint arXiv:2507.19064, 2025

arXiv 2025
[14]

Nguyen Binh, Zhou Liting, and Gurrin Cathal

Le Hoang-Bao, Tran Allie, T. Nguyen Binh, Zhou Liting, and Gurrin Cathal. Union: A lightweight target representation for efficient image-guided retrieval with optional textual queries. In2025 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2025

2025
[15]

Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. In Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

2023
[16]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. PREPRINT VERSION:17

2022
[17]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

2021
[18]

Learning multiple layers of features from tiny images.(2009), 2009

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.(2009), 2009

2009
[19]

Nguyen, Liting Zhou, and Cathal Gurrin

Hoang-Bao Le, Allie Tran, Binh T. Nguyen, Liting Zhou, and Cathal Gurrin. Figrotd: A friendly-to-handle dataset for image guided retrieval with optional text. InMultiMedia Modeling, pages 117–132, Singapore, 2026. Springer Nature Singapore

2026
[20]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InICML, 2022

2022
[21]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[22]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. URLhttps://arxiv.org/ abs/1405.0312

Pith/arXiv arXiv 2015
[23]

Deep sketch hash- ing: Fast free-hand sketch-based image retrieval

Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and Ling Shao. Deep sketch hash- ing: Fast free-hand sketch-based image retrieval. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 2862–2871, 2017

2017
[24]

C-clip: Multimodal continual learning for vision-language model

Wenzhuo Liu, Fei Zhu, Longhui Wei, and Qi Tian. C-clip: Multimodal continual learning for vision-language model. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[25]

Zero- shot composed text-image retrieval

Yikun Liu, Jiangchao Yao, Ya Zhang, Yan-Feng Wang, and Weidi Xie. Zero- shot composed text-image retrieval. InBMVC, page 381, 2023. URLhttp: //proceedings.bmvc2023.org/381/

2023
[26]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017
[27]

Compositional entailment learning for hyperbolic vision-language models

Avik Pal, Max van Spengler, Guido Maria D’Amely di Melendugno, Alessandro Fla- borea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025. URL https://openreview.net/forum?id=3i13Gev2hV

2025
[28]

Cross- domain adversarial feature learning for sketch re-identification

Lu Pang, Yaowei Wang, Yi-Zhe Song, Tiejun Huang, and Yonghong Tian. Cross- domain adversarial feature learning for sketch re-identification. InProceedings of the 26th ACM international conference on Multimedia, pages 609–617, 2018. 18PREPRINT VERSION:

2018
[29]

Know ”no” better: A data-driven approach for enhancing negation awareness in clip

Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know ”no” better: A data-driven approach for enhancing negation awareness in clip. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2825–2835, October 2025

2025
[30]

Ins- detclip: Aligning detection model to follow human-language instruction

Renjie Pi, Lewei Yao, Jianhua Han, Xiaodan Liang, Wei Zhang, and Hang Xu. Ins- detclip: Aligning detection model to follow human-language instruction. InThe Twelfth International Conference on Learning Representations, 2024

2024
[31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational confer- ence on machine learning, pages 8748–8763. PMLR, 2021

2021
[32]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115 (3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115 (3):211–252, 2015

2015
[33]

Learning the power of ”no”: Foundation models with negations

Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati. Learning the power of ”no”: Foundation models with negations. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 7991–8001, February 2025

2025
[34]

Hypervlm: Hyperbolic space guided vision language modeling for hierarchical multi-modal understanding

Sarthak Srivastava and Kathy Wu. Hypervlm: Hyperbolic space guided vision language modeling for hierarchical multi-modal understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2347–2358, 2025

2025
[35]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha-clip: A clip model focusing on wherever you want. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13019–13029, 2024. doi: 10.1109/CVPR52733.2024.01237

work page doi:10.1109/cvpr52733.2024.01237 2024
[36]

On the brittleness of clip text encoders.arXiv preprint arXiv:2511.04247, 2025

Allie Tran and Luca Rossetto. On the brittleness of clip text encoders.arXiv preprint arXiv:2511.04247, 2025

arXiv 2025
[37]

Siglip 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.a...

Pith/arXiv arXiv 2025
[38]

Springer Na- ture, 2022

Abraham Ungar.A gyrovector space approach to hyperbolic geometry. Springer Na- ture, 2022

2022
[39]

The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback.CVPR, 2021

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grau- man, and Rogerio Feris. The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback.CVPR, 2021

2021
[40]

Himo-clip: Modeling semantic hierarchy and monotonicity in vision-language alignment

Ruijia Wu, Ping Chen, Fei Shen, Shaoan Zhao, Qiang Hui, Huanlin Gao, Ting Lu, Zhaoxiang Liu, Fang Zhao, Kai Wang, et al. Himo-clip: Modeling semantic hierarchy and monotonicity in vision-language alignment. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 26974–26982, 2026. PREPRINT VERSION:19

2026
[41]

Smartclip: Modular vision-language alignment with identifi- cation guarantees

Shaoan Xie, Lingjing Lingjing, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P Xing, Guangyi Chen, and Kun Zhang. Smartclip: Modular vision-language alignment with identifi- cation guarantees. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29780–29790, 2025

2025
[42]

Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection.Advances in Neural Information Processing Systems, 35:9125–9138, 2022

Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection.Advances in Neural Information Processing Systems, 35:9125–9138, 2022

2022
[43]

Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment

Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, and Hang Xu. Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23497–23506, 2023

2023
[44]

Detclipv3: Towards versatile generative open-vocabulary object de- tection

Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. Detclipv3: Towards versatile generative open-vocabulary object de- tection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27391–27401, 2024

2024
[45]

A zero-shot framework for sketch based image retrieval

Sasi Kiran Yelamarthi, Shiva Krishna Reddy, Ashish Mishra, and Anurag Mittal. A zero-shot framework for sketch based image retrieval. InProceedings of the European Conference on Computer Vision (ECCV), pages 300–317, 2018

2018
[46]

Phyclip:ℓ 1-product of hyperbolic fac- tors unifies hierarchy and compositionality in vision-language representation learning,

Daiki Yoshikawa and Takashi Matsubara. Phyclip:ℓ 1-product of hyperbolic fac- tors unifies hierarchy and compositionality in vision-language representation learning,
[47]

URLhttps://arxiv.org/abs/2510.08919

arXiv
[48]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descrip- tions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the association for computational linguistics, 2:67–78, 2014

2014
[49]

When and why vision-language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations,

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations,
[50]

URLhttps://openreview.net/forum?id=KRLUvxh8uaX
[51]

Low-rank few-shot adaptation of vision- language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision- language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1593–1603, 2024

2024
[52]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 11975–11986, 2023

2023
[53]

Long-clip: Unlocking the long-text capability of clip

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InEuropean conference on computer vision, pages 310–325. Springer, 2024. 20PREPRINT VERSION:

2024
[54]

Sketchnet: Sketch classification with web images

Hua Zhang, Si Liu, Changqing Zhang, Wenqi Ren, Rui Wang, and Xiaochun Cao. Sketchnet: Sketch classification with web images. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 1105–1113, 2016

2016
[55]

Magiclens: Self-supervised image retrieval with open-ended instructions

Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. InThe Forty-first International Conference on Machine Learning (ICML), page to appear, 2024

2024
[56]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13774–13784, 2024. doi: 10.1109/CVPR52733.2024.01307

work page doi:10.1109/cvpr52733.2024.01307 2024
[57]

Rankclip: Ranking-consistent language-image pretraining

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, and Yining Sun. Rankclip: Ranking-consistent language-image pretraining. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3874–3884, October 2025. 6 Ablation Studies (Cont.) 6.1 Impact of Weightα 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 342.5 345.0 34...

2025

[1] [1]

iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval .IEEE Transactions on Pattern Analysis & Machine Intelligence, 47, 2025

Lorenzo Agnolucci, Alberto Baldrati, Alberto Del Bimbo, and Marco Bertini. iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval .IEEE Transactions on Pattern Analysis & Machine Intelligence, 47, 2025. doi: 10.1109/TPAMI.2025.3593539

work page doi:10.1109/tpami.2025.3593539 2025

[2] [2]

Vision-language models do not understand nega- tion

Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip HS Torr, Yoon Kim, and Marzyeh Ghassemi. Vision-language models do not understand nega- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29612–29622, 2025

2025

[3] [3]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

2024

[4] [4]

Hyperbolic image-text representations

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shan- mukha Ramakrishna Vedantam. Hyperbolic image-text representations. InInterna- tional Conference on Machine Learning, pages 7694–7731. PMLR, 2023

2023

[5] [5]

Doodle to search: Practical zero-shot sketch-based image retrieval

Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, and Yi-Zhe Song. Doodle to search: Practical zero-shot sketch-based image retrieval. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 2179–2188, 2019

2019

[6] [6]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- 16PREPRINT VERSION: vain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2020

2020

[7] [7]

Teaching structured vision & language concepts to vision & language models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Eli Schwartz, Roei Herzig, Raja Giryes, Rogerio Feris, Rameswar Panda, Shimon Ullman, and Leonid Karlinsky. Teaching structured vision & language concepts to vision & language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2657– 2668, 2023

2023

[8] [8]

Sugarcrepe++ dataset: Vision-language model sensi- tivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018, 2024

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensi- tivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018, 2024

2024

[9] [9]

Hyperbolic vision transformers: Combining improvements in metric learning

Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan Os- eledets. Hyperbolic vision transformers: Combining improvements in metric learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 7409–7419, 2022

2022

[10] [10]

Mitigate the gap: Improving cross-modal align- ment in CLIP

Sedigheh Eslami and Gerard de Melo. Mitigate the gap: Improving cross-modal align- ment in CLIP. InThe Thirteenth International Conference on Learning Representa- tions, 2025. URLhttps://openreview.net/forum?id=aPTGvFqile

2025

[11] [11]

SimCSE: Simple contrastive learn- ing of sentence embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learn- ing of sentence embeddings. InEmpirical Methods in Natural Language Processing (EMNLP), pages 6894–6910, Online and Punta Cana, Dominican Republic, 2021. As- sociation for Computational Linguistics

2021

[12] [12]

Cyclip: Cyclic contrastive language-image pretraining

Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan Rossi, Vishwa Vinay, and Aditya Grover. Cyclip: Cyclic contrastive language-image pretraining. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 6704–6719, 2022

2022

[13] [13]

Negation-aware test- time adaptation for vision-language models.arXiv preprint arXiv:2507.19064, 2025

Haochen Han, Alex Jinpeng Wang, Fangming Liu, and Jun Zhu. Negation-aware test- time adaptation for vision-language models.arXiv preprint arXiv:2507.19064, 2025

arXiv 2025

[14] [14]

Nguyen Binh, Zhou Liting, and Gurrin Cathal

Le Hoang-Bao, Tran Allie, T. Nguyen Binh, Zhou Liting, and Gurrin Cathal. Union: A lightweight target representation for efficient image-guided retrieval with optional textual queries. In2025 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2025

2025

[15] [15]

Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. In Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

2023

[16] [16]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. PREPRINT VERSION:17

2022

[17] [17]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

2021

[18] [18]

Learning multiple layers of features from tiny images.(2009), 2009

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.(2009), 2009

2009

[19] [19]

Nguyen, Liting Zhou, and Cathal Gurrin

Hoang-Bao Le, Allie Tran, Binh T. Nguyen, Liting Zhou, and Cathal Gurrin. Figrotd: A friendly-to-handle dataset for image guided retrieval with optional text. InMultiMedia Modeling, pages 117–132, Singapore, 2026. Springer Nature Singapore

2026

[20] [20]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InICML, 2022

2022

[21] [21]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023

[22] [22]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. URLhttps://arxiv.org/ abs/1405.0312

Pith/arXiv arXiv 2015

[23] [23]

Deep sketch hash- ing: Fast free-hand sketch-based image retrieval

Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and Ling Shao. Deep sketch hash- ing: Fast free-hand sketch-based image retrieval. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 2862–2871, 2017

2017

[24] [24]

C-clip: Multimodal continual learning for vision-language model

Wenzhuo Liu, Fei Zhu, Longhui Wei, and Qi Tian. C-clip: Multimodal continual learning for vision-language model. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[25] [25]

Zero- shot composed text-image retrieval

Yikun Liu, Jiangchao Yao, Ya Zhang, Yan-Feng Wang, and Weidi Xie. Zero- shot composed text-image retrieval. InBMVC, page 381, 2023. URLhttp: //proceedings.bmvc2023.org/381/

2023

[26] [26]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017

[27] [27]

Compositional entailment learning for hyperbolic vision-language models

Avik Pal, Max van Spengler, Guido Maria D’Amely di Melendugno, Alessandro Fla- borea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025. URL https://openreview.net/forum?id=3i13Gev2hV

2025

[28] [28]

Cross- domain adversarial feature learning for sketch re-identification

Lu Pang, Yaowei Wang, Yi-Zhe Song, Tiejun Huang, and Yonghong Tian. Cross- domain adversarial feature learning for sketch re-identification. InProceedings of the 26th ACM international conference on Multimedia, pages 609–617, 2018. 18PREPRINT VERSION:

2018

[29] [29]

Know ”no” better: A data-driven approach for enhancing negation awareness in clip

Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know ”no” better: A data-driven approach for enhancing negation awareness in clip. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2825–2835, October 2025

2025

[30] [30]

Ins- detclip: Aligning detection model to follow human-language instruction

Renjie Pi, Lewei Yao, Jianhua Han, Xiaodan Liang, Wei Zhang, and Hang Xu. Ins- detclip: Aligning detection model to follow human-language instruction. InThe Twelfth International Conference on Learning Representations, 2024

2024

[31] [31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational confer- ence on machine learning, pages 8748–8763. PMLR, 2021

2021

[32] [32]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115 (3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115 (3):211–252, 2015

2015

[33] [33]

Learning the power of ”no”: Foundation models with negations

Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati. Learning the power of ”no”: Foundation models with negations. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 7991–8001, February 2025

2025

[34] [34]

Hypervlm: Hyperbolic space guided vision language modeling for hierarchical multi-modal understanding

Sarthak Srivastava and Kathy Wu. Hypervlm: Hyperbolic space guided vision language modeling for hierarchical multi-modal understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2347–2358, 2025

2025

[35] [35]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha-clip: A clip model focusing on wherever you want. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13019–13029, 2024. doi: 10.1109/CVPR52733.2024.01237

work page doi:10.1109/cvpr52733.2024.01237 2024

[36] [36]

On the brittleness of clip text encoders.arXiv preprint arXiv:2511.04247, 2025

Allie Tran and Luca Rossetto. On the brittleness of clip text encoders.arXiv preprint arXiv:2511.04247, 2025

arXiv 2025

[37] [37]

Siglip 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features.a...

Pith/arXiv arXiv 2025

[38] [38]

Springer Na- ture, 2022

Abraham Ungar.A gyrovector space approach to hyperbolic geometry. Springer Na- ture, 2022

2022

[39] [39]

The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback.CVPR, 2021

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grau- man, and Rogerio Feris. The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback.CVPR, 2021

2021

[40] [40]

Himo-clip: Modeling semantic hierarchy and monotonicity in vision-language alignment

Ruijia Wu, Ping Chen, Fei Shen, Shaoan Zhao, Qiang Hui, Huanlin Gao, Ting Lu, Zhaoxiang Liu, Fang Zhao, Kai Wang, et al. Himo-clip: Modeling semantic hierarchy and monotonicity in vision-language alignment. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 26974–26982, 2026. PREPRINT VERSION:19

2026

[41] [41]

Smartclip: Modular vision-language alignment with identifi- cation guarantees

Shaoan Xie, Lingjing Lingjing, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P Xing, Guangyi Chen, and Kun Zhang. Smartclip: Modular vision-language alignment with identifi- cation guarantees. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29780–29790, 2025

2025

[42] [42]

Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection.Advances in Neural Information Processing Systems, 35:9125–9138, 2022

Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection.Advances in Neural Information Processing Systems, 35:9125–9138, 2022

2022

[43] [43]

Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment

Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, and Hang Xu. Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23497–23506, 2023

2023

[44] [44]

Detclipv3: Towards versatile generative open-vocabulary object de- tection

Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. Detclipv3: Towards versatile generative open-vocabulary object de- tection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27391–27401, 2024

2024

[45] [45]

A zero-shot framework for sketch based image retrieval

Sasi Kiran Yelamarthi, Shiva Krishna Reddy, Ashish Mishra, and Anurag Mittal. A zero-shot framework for sketch based image retrieval. InProceedings of the European Conference on Computer Vision (ECCV), pages 300–317, 2018

2018

[46] [46]

Phyclip:ℓ 1-product of hyperbolic fac- tors unifies hierarchy and compositionality in vision-language representation learning,

Daiki Yoshikawa and Takashi Matsubara. Phyclip:ℓ 1-product of hyperbolic fac- tors unifies hierarchy and compositionality in vision-language representation learning,

[47] [47]

URLhttps://arxiv.org/abs/2510.08919

arXiv

[48] [48]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descrip- tions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the association for computational linguistics, 2:67–78, 2014

2014

[49] [49]

When and why vision-language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations,

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations,

[50] [50]

URLhttps://openreview.net/forum?id=KRLUvxh8uaX

[51] [51]

Low-rank few-shot adaptation of vision- language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision- language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1593–1603, 2024

2024

[52] [52]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 11975–11986, 2023

2023

[53] [53]

Long-clip: Unlocking the long-text capability of clip

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InEuropean conference on computer vision, pages 310–325. Springer, 2024. 20PREPRINT VERSION:

2024

[54] [54]

Sketchnet: Sketch classification with web images

Hua Zhang, Si Liu, Changqing Zhang, Wenqi Ren, Rui Wang, and Xiaochun Cao. Sketchnet: Sketch classification with web images. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 1105–1113, 2016

2016

[55] [55]

Magiclens: Self-supervised image retrieval with open-ended instructions

Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. InThe Forty-first International Conference on Machine Learning (ICML), page to appear, 2024

2024

[56] [56]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13774–13784, 2024. doi: 10.1109/CVPR52733.2024.01307

work page doi:10.1109/cvpr52733.2024.01307 2024

[57] [57]

Rankclip: Ranking-consistent language-image pretraining

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, and Yining Sun. Rankclip: Ranking-consistent language-image pretraining. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3874–3884, October 2025. 6 Ablation Studies (Cont.) 6.1 Impact of Weightα 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 342.5 345.0 34...

2025