PowerCLIP: Powerset Alignment for Contrastive Pre-Training

Hirokatsu Kataoka; Masaki Kawamura; Nakamasa Inoue; Rintaro Yanagi; Rio Yokota

arxiv: 2511.23170 · v5 · submitted 2025-11-28 · 💻 cs.CV

PowerCLIP: Powerset Alignment for Contrastive Pre-Training

Masaki Kawamura , Nakamasa Inoue , Rintaro Yanagi , Hirokatsu Kataoka , Rio Yokota This is my paper

Pith reviewed 2026-05-17 04:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords contrastive pre-trainingvision-language modelspowersetsimage region alignmenttext parse treeszero-shot classificationcompositionalitynon-linear aggregators

0 comments

The pith

PowerCLIP aligns every subset of image regions with text phrases from parse trees to capture multi-part semantics during contrastive pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops PowerCLIP as an extension of contrastive vision-language training that goes beyond single region-token pairs. It defines a loss that matches the full collection of region subsets against phrases extracted from the text's parse tree. This exhaustive matching is intended to build stronger compositional representations. To avoid the exponential cost of listing all subsets, the authors replace direct powerset operations with non-linear aggregators that keep the cost linear in the number of regions while staying close to the original loss value. Experiments then show gains over prior methods on zero-shot classification and retrieval benchmarks.

Core claim

PowerCLIP minimizes a contrastive loss defined between the powerset of image regions and the parse tree of the accompanying text; non-linear aggregators reduce the cost from exponential to linear in the number of regions while preserving arbitrary approximation accuracy to the exact loss.

What carries the argument

Powerset alignment between image-region subsets and textual parse-tree phrases, made tractable by non-linear aggregators that replace full enumeration.

If this is right

Zero-shot classification accuracy rises on tasks that require understanding relations among several image parts.
Image-to-text and text-to-image retrieval improve when queries involve compositional descriptions.
The learned representations become more robust to variations in how objects are grouped within scenes.
Training remains practical because the added alignment step scales linearly rather than exponentially.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same subset-alignment idea could be tested on video or audio where multiple elements must be matched to phrases.
If the approximation works well, it opens the door to hierarchical or tree-structured alignments in other contrastive frameworks.
Performance on fine-grained benchmarks could serve as a practical test of whether the approximated loss retains the key compositional signal.

Load-bearing premise

Non-linear aggregators can approximate the exact powerset loss arbitrarily closely while keeping computation linear in the number of regions.

What would settle it

Compute the exact powerset loss on a toy dataset with few regions and compare it directly to the aggregator output; large divergence would indicate the approximation fails to support the claimed gains.

Figures

Figures reproduced from arXiv: 2511.23170 by Hirokatsu Kataoka, Masaki Kawamura, Nakamasa Inoue, Rintaro Yanagi, Rio Yokota.

**Figure 2.** Figure 2: Performance comparison between PowerCLIP and the best-performing method among seven state-of-the-art approaches (CLIP, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the powerset alignment strategy for PowerCLIP. (a) Region embeddings are extracted for each subset A of region masks in M. (b) Phrase embeddings are extracted for each node B in the parse tree T . (c) Powerset alignment minimizes the triplet loss defined based on the bidirectional similarity: region-set-to-tree (R2T) and vice versa (T2R). 3. Method This section introduces PowerCLIP, a novel con… view at source ↗

**Figure 4.** Figure 4: Non-Linear Aggregator (NLA). Each layer applies aggregation followed by activation. region subset. T2R Aggregation. Conversely, this aggregation computes the best-matching region subset for each phrase. We define the T2R similarity matrix Q← ∈R C×C as Q ← i,j = 1 |Tj | X B∈Tj max A⊆Mi Qi,j,A,B. (7) This emphasizes phrase-level grounding by ensuring each phrase is closely matched to a region subset. Loss F… view at source ↗

**Figure 5.** Figure 5: Approximation accuracy evaluation. Top: Comparison between exact and approximated losses for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visualizations of text-to-patch similarities. For each input text, we compute similarities between the text representation and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Approximation accuracy evaluation for NLA-T1 and NLA-T2. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 9.** Figure 9: Per-epoch training time vs number of masks K with and without approximation. Without approximation, runs with K>7 fail due to OOM. Method Train time (s) Rel. to CLIP CLIP [49] 1378 1.00× SPARC [3] 1730 1.26× FILIP [17] 1947 1.41× PowerCLIP 2366 1.72× [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of text-to-patch similarity heatmaps across different models. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative examples of compositional reasoning. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Code is available at https://github.com/Masakichi210/PowerCLIP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PowerCLIP, a contrastive vision-language pre-training method that augments CLIP with powerset alignment: it minimizes a loss over all subsets of image regions matched against phrases from textual parse trees to capture multi-region compositional semantics. To avoid the O(2^M) cost of explicit powerset enumeration, the authors introduce non-linear aggregators (NLAs) claimed to reduce complexity to O(M) while approximating the exact powerset loss to arbitrary precision. Experiments are reported to show improved zero-shot classification and retrieval performance over prior methods.

Significance. If the NLA approximation faithfully preserves the higher-order subset interactions of the exact powerset objective and the empirical gains are reproducible with proper controls, the work could advance fine-grained compositional alignment in VL models. The public code release supports reproducibility.

major comments (2)

[Non-linear aggregators (NLAs) section] The central claim that powerset alignment drives improved compositionality rests on NLAs approximating the exact loss with arbitrary precision. No formal error bound, convergence analysis, or empirical measurement of approximation error (e.g., difference between NLA and brute-force powerset loss on small M) is provided to confirm that subset-interaction terms are preserved; without this, the optimized objective may diverge from the stated powerset construction.
[Abstract and Experiments] The abstract states that extensive experiments demonstrate outperformance on zero-shot tasks, yet no quantitative results, error bars, dataset details, or ablations (e.g., full powerset vs. NLA, or NLA error vs. downstream gains) are supplied. This leaves the load-bearing claim that the powerset mechanism (rather than the NLA heuristic) produces the reported benefits unverified.

minor comments (2)

[Method overview] Clarify early how textual parse trees are obtained and how region proposals are generated, including any hyperparameters that affect M.
[NLAs definition] The claim of 'arbitrary precision' approximation should be accompanied by a concrete statement of the approximation scheme (e.g., which non-linear functions are used and under what conditions the error vanishes).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, acknowledging where the manuscript can be strengthened through revisions while defending the core technical contributions on substantive grounds.

read point-by-point responses

Referee: [Non-linear aggregators (NLAs) section] The central claim that powerset alignment drives improved compositionality rests on NLAs approximating the exact loss with arbitrary precision. No formal error bound, convergence analysis, or empirical measurement of approximation error (e.g., difference between NLA and brute-force powerset loss on small M) is provided to confirm that subset-interaction terms are preserved; without this, the optimized objective may diverge from the stated powerset construction.

Authors: We agree that the manuscript would benefit from explicit validation of the NLA approximation. The NLAs are constructed to preserve higher-order subset interactions via non-linear pooling that approximates the combinatorial sum in the powerset loss; however, the current version does not include formal error bounds or direct empirical comparisons to brute-force enumeration. In the revised manuscript we will add a new subsection under the NLA description that derives a bound on the approximation error under Lipschitz assumptions on the aggregator functions and reports empirical loss differences for small M (M ≤ 5) on held-out image-text pairs, confirming that the dominant interaction terms are retained. revision: yes
Referee: [Abstract and Experiments] The abstract states that extensive experiments demonstrate outperformance on zero-shot tasks, yet no quantitative results, error bars, dataset details, or ablations (e.g., full powerset vs. NLA, or NLA error vs. downstream gains) are supplied. This leaves the load-bearing claim that the powerset mechanism (rather than the NLA heuristic) produces the reported benefits unverified.

Authors: Abstracts are intentionally concise and do not contain numerical results or error bars; those appear in the Experiments section. We nevertheless recognize that additional controls are needed to isolate the powerset contribution. The revised manuscript will expand the Experiments section with (i) full quantitative tables including standard deviations over multiple seeds, (ii) explicit dataset and hyper-parameter details, and (iii) new ablations that compare NLA against exact powerset loss (feasible for small M) and plot downstream gains against measured approximation error, thereby verifying that performance improvements track the powerset objective rather than the aggregator implementation alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation begins from the standard CLIP contrastive objective and defines a new powerset alignment loss directly over region subsets and parse-tree phrases. The non-linear aggregators are introduced as a computational reduction that approximates this loss, with the overall framework validated through independent zero-shot classification and retrieval experiments rather than any self-referential fit, redefinition, or load-bearing self-citation chain. No equation or claim reduces the reported gains to a parameter fitted from the target data or to a prior result whose justification collapses back into the current paper. The approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the introduced powerset alignment captures compositional semantics better than single-region baselines and that the NLA approximation preserves the essential loss signal; no free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption Standard contrastive loss framework from CLIP-style models
The paper extends an existing pre-training paradigm rather than deriving a new objective from first principles.

invented entities (1)

Non-linear aggregators (NLAs) no independent evidence
purpose: Efficient approximation of the exact powerset loss
New computational construct introduced to avoid exponential cost while claiming arbitrary-precision approximation.

pith-pipeline@v0.9.0 · 5514 in / 1271 out tokens · 31046 ms · 2026-05-17T04:54:41.570678+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NLAs ... reduce complexity from O(2^M) to O(M) ... approximating the exact loss value with arbitrary precision (Theorems 1 and 2).
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NLA-T1 with Softplus approximates T2R max aggregation; NLA-T2 with tanh interpolates R2T bounds.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 3 internal anchors

[1]

Finelip: Extending clip’s reach via fine-grained alignment with longer text inputs

Mothilal Asokan, Kebin Wu, and Fatima Albreiki. Finelip: Extending clip’s reach via fine-grained alignment with longer text inputs. InProc. IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025. 1, 2, 4

work page 2025
[2]

Learning local feature descriptors with triplets and shallow convolutional neural networks

Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. Learning local feature descriptors with triplets and shallow convolutional neural networks. InProc. British Machine Vision Conference (BMVC), 2016. 4

work page 2016
[3]

Grit- senko, Matthias Minderer, Charles Blundell, Razvan Pas- canu, and Jovana Mitrovi’c

Ioana Bica, Anastasija Ili’c, Matthias Bauer, G”oker Erdo- gan, Matko Bo ˇsnjak, Christos Kaplanis, Alexey A. Grit- senko, Matthias Minderer, Charles Blundell, Razvan Pas- canu, and Jovana Mitrovi’c. Improving fine-grained under- standing in image-text pre-training. InProc. International Conference on Machine Learning (ICML), pages 3974– 3995, 2024. 1, 2,...

work page 2024
[4]

Food-101–mining discriminative components with random forests

Bossard, Lukas, Guillaumin, Matthieu, Van Gool, and Luc. Food-101–mining discriminative components with random forests. InProc. European Conference on Computer Vision (ECCV), pages 446–461, 2014. 6

work page 2014
[5]

Conceptual 12m: Pushing web-scale image- text pre-training to recognize long-tail visual concepts

Changpinyo, Soravit, Sharma, Piyush, Ding, Nan, Soricut, and Radu. Conceptual 12m: Pushing web-scale image- text pre-training to recognize long-tail visual concepts. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 3558–3568, 2021. 6

work page 2021
[6]

Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

Cheng, Gong, Han, Junwei, Lu, and Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017. 6

work page 2017
[7]

Goal: Global-local object alignment learning

Choi, Hyungyu, Jang, Young Kyun, Eom, and Chanho. Goal: Global-local object alignment learning. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4070–4079, 2025. 1, 2

work page 2025
[8]

Fine-grained image-text correspondence with cost aggregation for open-vocabulary part segmenta- tion

Jiho Choi, Seonho Lee, Minhyun Lee, Seungho Lee, and Hyunjung Shim. Fine-grained image-text correspondence with cost aggregation for open-vocabulary part segmenta- tion. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9782–9793, 2025. 2

work page 2025
[9]

Meta clip 2: A worldwide scaling recipe

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen tau Yih, Shang-Wen Li, and Hu Xu. Meta CLIP 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062, pages 1–10, 2025. 2

work page arXiv 2025
[10]

Describing textures in the wild

Cimpoi, Mircea, Maji, Subhransu, Kokkinos, Iasonas, Mo- hamed, Sammy, Vedaldi, and Andrea. Describing textures in the wild. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3606–3613,

work page
[11]

An analysis of single-layer networks in unsupervised feature learning

Coates, Adam, Ng, Andrew, Lee, and Honglak. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, pages 215–223, 2011. 6

work page 2011
[12]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. InProc. IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 248–255, 2009. 6

work page 2009
[13]

MaskCLIP: Masked self-distillation advances contrastive language-image pretraining

Xiaoyi Dong, Jianmin Bao, Yinglin Zheng, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, and Nenghai Yu. MaskCLIP: Masked self-distillation advances contrastive language-image pretraining. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10995–11005, 2023. 2

work page 2023
[14]

An image is worth 16×16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InProc. International Conference on Learning Repre- sentations (ICLR), 2021. 6

work page 2021
[15]

DIH-CLIP: Unleashing the diversity of Multi-Head Self-Attention for Training-Free Open-V ocabulary semantic segmentation

Songsong Duan, Xi Yang, and Nannan Wang. DIH-CLIP: Unleashing the diversity of Multi-Head Self-Attention for Training-Free Open-V ocabulary semantic segmentation. In Proc. IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 22794–22803, 2025. 2

work page 2025
[16]

Milios, Sageev Oore, and Hassan Saj- jad

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos E. Milios, Sageev Oore, and Hassan Saj- jad. Sugarcrepe++dataset: Vision-language model sen- sitivity to semantic and lexical alterations. InProc. An- nual Conference on Neural Information Processing Systems (NeurIPS), 2024. 2

work page 2024
[17]

Filip: Fine-grained interactive language- image pre-training

Lewei Yao et al. Filip: Fine-grained interactive language- image pre-training. InProc. International Conference on Learning Representations (ICLR), 2022. 1, 2, 4, 6, 7, 8, 5

work page 2022
[18]

The pascal visual object classes (voc) challenge.International Journal of Com- puter Vision (IJCV), 88:303–338, 2010

Everingham, Mark, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, and Andrew. The pascal visual object classes (voc) challenge.International Journal of Com- puter Vision (IJCV), 88:303–338, 2010. 6

work page 2010
[19]

Improving clip training with language rewrites

Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites. InProc. Annual Conference on Neural Information Processing Systems (NeurIPS), pages 35544–35575, 2023. 2

work page 2023
[20]

Learning gen- erative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

Fei-Fei, Li, Fergus, Rob, Perona, and Pietro. Learning gen- erative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 178–178,

work page
[21]

Linguistic-aware patch slimming framework for fine-grained cross-modal alignment

Zheren Fu, Lei Zhang, Hou Xia, and Zhendong Mao. Linguistic-aware patch slimming framework for fine-grained cross-modal alignment. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26297–26306, 2024. 2

work page 2024
[22]

Clip-adapted region-to-text learning for generative open- vocabulary semantic segmentation

Jiannan Ge, Lingxi Xie, Hongtao Xie, Pandeng Li, Sun- Ao Liu, Xiaopeng Zhang, Qi Tian, and Yongdong Zhang. Clip-adapted region-to-text learning for generative open- vocabulary semantic segmentation. InProc. IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 24034–24044, 2025. 2

work page 2025
[23]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProc. IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 16000– 16009, 2022. 2

work page 2022
[24]

Helber, Patrick, Bischke, Benjamin, Dengel, Andreas, Borth, and Damian. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 6

work page 2019
[25]

The many faces of robust- ness: A critical analysis of out-of-distribution generalization

Hendrycks, Dan, Basart, Steven, Mu, Norman, Kadavath, Saurav, Wang, Frank, Dorundo, Evan, Desai, Rahul, Zhu, Tyler, Parajuli, Samyak, Guo, Mike, Song, Dawn, Stein- hardt, Jacob, Gilmer, and Justin. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. InProc. IEEE/CVF International Conference on Computer Vision (ICCV),...

work page 2021
[26]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein- hardt, and Dawn Song. Natural adversarial examples. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 15262–15271, 2021. 6

work page 2021
[27]

SugarCrepe: Fixing hackable benchmarks for vision-language compositionality

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. InProc. Annual Conference on Neural Information Processing Sys- tems (NeurIPS), page 31096–31116, 2023. 2, 6

work page 2023
[28]

Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language repre- sentation learning with noisy text supervision. InProc. In- ternational Conference on Machine Learning (ICML), pages 4904–4916, 2021. 2

work page 2021
[29]

FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding

Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding. InProc. Annual Conference on Neural Infor- mation Processing Systems (NeurIPS), pages 27896–27918,

work page
[30]

V o, Patrick Labatut, and Piotr Bo- janowski

Cijo Jose, Th’eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth’ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha”el Ramamonjisoa, Maxime Oquab, Ori- ane Sim’eoni, Huy V . V o, Patrick Labatut, and Piotr Bo- janowski. DINOv2 meets text: A unified framework for image- and pixel-level vision-language alignment. InProc. IEEE/CVF Conference on Comput...

work page 2025
[31]

Is CLIP ideal? no

Raphi Kang, Yue Song, Georgia Gkioxari, and Pietro Perona. Is CLIP ideal? no. can we fix it? yes! InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 22436–22446, 2025. 2

work page 2025
[32]

3d object representations for fine-grained categorization

Krause, Jonathan, Stark, Michael, Deng, Jia, Fei-Fei, and Li. 3d object representations for fine-grained categorization. InProc. IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. 6

work page 2013
[33]

Learning multiple layers of features from tiny images.Technical Report and University of Tront,

Krizhevsky and Alex. Learning multiple layers of features from tiny images.Technical Report and University of Tront,

work page
[34]

VeCLIP: Improving clip training via visual-enriched cap- tions

Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiu- long Shan, Chen-Nee Chuah, Yinfei Yang, and Meng Cao. VeCLIP: Improving clip training via visual-enriched cap- tions. InProc. European Conference on Computer Vision (ECCV), pages 111–127, 2024. 2

work page 2024
[35]

Scaling language-image pre- training via masking

Li, Yanghao, Fan, Haoqi, Hu, Ronghang, Feichtenhofer, Christoph, He, and Kaiming. Scaling language-image pre- training via masking. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23390–23400, 2023. 2, 6, 7, 8, 4

work page 2023
[36]

Grounded language-image pre-training

Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2022. 2

work page 2022
[37]

Mask-Adapter: The Devil is in the Masks for Open-V ocabulary Segmentation

Yongkang Li, Tianheng Cheng, Bin Feng, Wenyu Liu, and Xinggang Wang. Mask-Adapter: The Devil is in the Masks for Open-V ocabulary Segmentation. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14998–15008, 2025. 2

work page 2025
[38]

Unbiased Region– Language alignment for Open-V ocabulary dense prediction

Yunheng Li, Yuxuan Li, Quan-Sheng Zeng, Wenhai Wang, Qibin Hou, and Ming-Ming Cheng. Unbiased Region– Language alignment for Open-V ocabulary dense prediction. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 23795–23805, 2025. 2

work page 2025
[39]

Lawrence

Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll ´ar, Piotr, Zit- nick, and C. Lawrence. Microsoft coco: Common objects in context. InProc. European Conference on Computer Vision (ECCV), pages 740–755, 2014. 6

work page 2014
[40]

Fine-Grained Visual Classification of Aircraft

Maji, Subhransu, Rahtu, Esa, Kannala, Juho, Blaschko, Matthew, Vedaldi, and Andrea. Fine-grained visual classi- fication of aircraft.arXiv preprint arXiv:1306.5151, 2013. 6

work page internal anchor Pith review Pith/arXiv arXiv 2013
[41]

SLIP: Self-supervision meets language-image pre- training

Norman Mu, Alexander Kirillov, David Wagner, and Sain- ing Xie. SLIP: Self-supervision meets language-image pre- training. InProc. European Conference on Computer Vision (ECCV), pages 529–544, 2022. 2

work page 2022
[42]

Open vocabulary semantic segmentation with patch aligned contrastive learning

Mukhoti, Jishnu, Lin, Tsung-Yu, Poursaeed, Omid, Wang, Rui, Shah, Ashish, Torr, Philip H.S., Lim, and Ser-Nam. Open vocabulary semantic segmentation with patch aligned contrastive learning. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

work page 2023
[43]

Automated flower classification over a large number of classes

Nilsback, Maria-Elena, Zisserman, and Andrew. Automated flower classification over a large number of classes. InProc. Indian Conference on Computer Vision and Graphics & Im- age Processing, pages 722–729, 2008. 6

work page 2008
[44]

Know “No” better: A data- driven approach for enhancing negation awareness in CLIP

Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know “No” better: A data- driven approach for enhancing negation awareness in CLIP. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 2825–2835, 2025. 2

work page 2025
[45]

Cats and dogs

Parkhi, Omkar M, Vedaldi, Andrea, Zisserman, Andrew, Jawahar, and CV . Cats and dogs. InProc. IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 3498–3505, 2012. 6

work page 2012
[46]

TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives

Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives. InProc. Annual Conference on Neural Information Processing Systems (NeurIPS), pages 32731–32760, 2024. 1, 2

work page 2024
[47]

Seeing what matters: Empowering CLIP with patch generation-to-selection

Gensheng Pei, Tao Chen, Yujia Wang, Xinhao Cai, Xiangbo Shu, Tianfei Zhou, and Yazhou Yao. Seeing what matters: Empowering CLIP with patch generation-to-selection. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 24862–24872, 2025. 1, 2, 6, 7, 8, 4

work page 2025
[48]

Parameter-efficient fine-tuning in hyperspherical space for open-vocabularysemantic segmen- tation

Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yu Huang, Yaom- ing Wang, and Wei Shen. Parameter-efficient fine-tuning in hyperspherical space for open-vocabularysemantic segmen- tation. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15009–15020, 2025. 2

work page 2025
[49]

Learn- ing transferable visual models from natural language super- vision

Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, et al. Learn- ing transferable visual models from natural language super- vision. InProc. International Conference on Machine Learn- ing (ICML), pages 8748–8763, 2021. 1, 2, 6, 7, 8, 4, 5

work page 2021
[50]

Sam 2: Segment anything in images and videos

Ravi, Nikhila, Gabeur, Valentin, Hu, Yuan-Ting, Hu, Rong- hang, Ryali, Chaitanya, Ma, Tengyu, Khedr, Haitham, R¨adle, Roman, Rolland, Chloe, Gustafson, Laura, Mintun, Eric, Pan, Junting, Alwala, Kalyan Vasudev, Carion, Nicolas, Wu, Chao-Yuan, Girshick, Ross, Doll´ar, Piotr, Feichtenhofer, and Christoph. Sam 2: Segment anything in images and videos. InProc...

work page 2025
[51]

Do imagenet classifiers generalize to ima- genet? InProc

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to ima- genet? InProc. International Conference on Machine Learn- ing (ICML), pages 5389–5400, 2019. 6

work page 2019
[52]

The german traffic sign recognition bench- mark: a multi-class classification competition

Stallkamp, Johannes, Schlipsing, Marc, Salmen, Jan, Igel, and Christian. The german traffic sign recognition bench- mark: a multi-class classification competition. InIJCNN, pages 1453–1460, 2011. 6

work page 2011
[53]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EV A-CLIP: Improved training techniques for CLIP at scale.arXiv preprint arXiv:2303.15389, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Alpha- CLIP: A CLIP model focusing on wherever you want

Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha- CLIP: A CLIP model focusing on wherever you want. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 13019–13029, 2024. 2

work page 2024
[55]

Winoground: Probing vision and language models for visio- linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 5228–5238, 2022. 2, 6

work page 2022
[56]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H’enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision- language encoders with improved semantic understanding and localization and and dense fe...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Rotation equivariant cnns for digital pathology

Veeling, Bastiaan S, Linmans, Jasper, Winkens, Jim, Cohen, Taco, Welling, and Max. Rotation equivariant cnns for digital pathology. InMICCAI, pages 210–218, 2018. 6

work page 2018
[58]

Fix- clip: Dual-branch hierarchical contrastive learning via syn- thetic captions for better understanding of long text

Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, and Wei Liu. Fix- clip: Dual-branch hierarchical contrastive learning via syn- thetic captions for better understanding of long text. In Proc. IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 20694–20704, 2025. 2

work page 2025
[59]

Lipton, and Eric P

Haohan Wang, Songwei Ge, Zachary C. Lipton, and Eric P. Xing. Learning robust global representations by penaliz- ing local predictive power. InProc. Annual Conference on Neural Information Processing Systems (NeurIPS), pages 13754–13764, 2019. 6

work page 2019
[60]

Efficient vision-language pre-training by cluster masking

Wei, Zihao, Pan, Zixuan, Owens, and Andrew. Efficient vision-language pre-training by cluster masking. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26815–26825, 2024. 1, 2, 6, 7, 8, 4

work page 2024
[61]

MaskFeat: Masked feature prediction for self-supervised visual pre-training

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. MaskFeat: Masked feature prediction for self-supervised visual pre-training. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 14668–14678, 2022. 2

work page 2022
[62]

Hq-clip: Leveraging large Vision-Language models to create high-quality image- text datasets and CLIP models

Zhixiang Wei, Guangting Wang, Xiaoxiao Ma, Ke Mei, Hua- ian Chen, Yi Jin, and Fengyun Rao. Hq-clip: Leveraging large Vision-Language models to create high-quality image- text datasets and CLIP models. InProc. IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 22447–22456, 2025. 2

work page 2025
[63]

Sun database: Large-scale scene recognition from abbey to zoo

Xiao, Jianxiong, Hays, James, Ehinger, Krista A, Oliva, Aude, Torralba, and Antonio. Sun database: Large-scale scene recognition from abbey to zoo. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3485–3492, 2010. 6

work page 2010
[64]

Fg- clip: Fine-grained visual and textual alignment

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. Fg- clip: Fine-grained visual and textual alignment. InProc. In- ternational Conference on Machine Learning (ICML), 2025. 2

work page 2025
[65]

SimMIM: A simple framework for masked image modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A simple framework for masked image modeling. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9643–9653, 2022. 2

work page 2022
[66]

Demystify- ing CLIP data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystify- ing CLIP data. InProc. International Conference on Learn- ing Representations (ICLR), 2024. 2

work page 2024
[67]

Attentive mask clip

Yang, Yifan, Huang, Weiquan, Wei, Yixuan, Peng, Houwen, Jiang, Xinyang, Jiang, Huiqiang, Wei, Fangyun, Wang, Yin, Hu, Han, Qiu, Lili, et al. Attentive mask clip. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 2771–2781, 2023. 1, 2, 6, 7, 8, 4

work page 2023
[68]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 6

work page 2014
[69]

CoCa: Contrastive captioners are image-text foundation models.Transactions on Machine Learning Research (TMLR), pages 1–20, 2022

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. CoCa: Contrastive captioners are image-text foundation models.Transactions on Machine Learning Research (TMLR), pages 1–20, 2022. 2

work page 2022
[70]

When and why vision- language models behave like bags-of-words and and what to do about it? InProc

Mert Y ¨uksekg¨on¨ul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words and and what to do about it? InProc. International Conference on Learning Representations (ICLR), 2023. 2

work page 2023
[71]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023. 2

work page 2023
[72]

Long-CLIP: Unlocking the long-text capability of CLIP

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-CLIP: Unlocking the long-text capability of CLIP. InProc. European Conference on Computer Vision (ECCV), pages 310–325, 2024. 2

work page 2024
[73]

Corrclip: Recon- structing patch correlations in CLIP for open-vocabulary se- mantic segmentation

Dengke Zhang, Fagui Liu, and Quan Tang. Corrclip: Recon- structing patch correlations in CLIP for open-vocabulary se- mantic segmentation. InProc. IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 24677–24687,

work page
[74]

Re- gionCLIP: Region-based Language-Image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chun- yuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Re- gionCLIP: Region-based Language-Image pretraining. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 16793–16803, 2022. 2

work page 2022
[75]

a horse"

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from CLIP. InProc. European Conference on Computer Vision (ECCV), pages 696–712, 2022. 2 PowerCLIP: Powerset Alignment for Contrastive Pre-Training Supplementary Material Appendix A. Proof of Theorem 1 In this section, we present a proof of Theorem 1. We first restate the definitions of th...

work page 2022

[1] [1]

Finelip: Extending clip’s reach via fine-grained alignment with longer text inputs

Mothilal Asokan, Kebin Wu, and Fatima Albreiki. Finelip: Extending clip’s reach via fine-grained alignment with longer text inputs. InProc. IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025. 1, 2, 4

work page 2025

[2] [2]

Learning local feature descriptors with triplets and shallow convolutional neural networks

Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. Learning local feature descriptors with triplets and shallow convolutional neural networks. InProc. British Machine Vision Conference (BMVC), 2016. 4

work page 2016

[3] [3]

Grit- senko, Matthias Minderer, Charles Blundell, Razvan Pas- canu, and Jovana Mitrovi’c

Ioana Bica, Anastasija Ili’c, Matthias Bauer, G”oker Erdo- gan, Matko Bo ˇsnjak, Christos Kaplanis, Alexey A. Grit- senko, Matthias Minderer, Charles Blundell, Razvan Pas- canu, and Jovana Mitrovi’c. Improving fine-grained under- standing in image-text pre-training. InProc. International Conference on Machine Learning (ICML), pages 3974– 3995, 2024. 1, 2,...

work page 2024

[4] [4]

Food-101–mining discriminative components with random forests

Bossard, Lukas, Guillaumin, Matthieu, Van Gool, and Luc. Food-101–mining discriminative components with random forests. InProc. European Conference on Computer Vision (ECCV), pages 446–461, 2014. 6

work page 2014

[5] [5]

Conceptual 12m: Pushing web-scale image- text pre-training to recognize long-tail visual concepts

Changpinyo, Soravit, Sharma, Piyush, Ding, Nan, Soricut, and Radu. Conceptual 12m: Pushing web-scale image- text pre-training to recognize long-tail visual concepts. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 3558–3568, 2021. 6

work page 2021

[6] [6]

Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

Cheng, Gong, Han, Junwei, Lu, and Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017. 6

work page 2017

[7] [7]

Goal: Global-local object alignment learning

Choi, Hyungyu, Jang, Young Kyun, Eom, and Chanho. Goal: Global-local object alignment learning. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4070–4079, 2025. 1, 2

work page 2025

[8] [8]

Fine-grained image-text correspondence with cost aggregation for open-vocabulary part segmenta- tion

Jiho Choi, Seonho Lee, Minhyun Lee, Seungho Lee, and Hyunjung Shim. Fine-grained image-text correspondence with cost aggregation for open-vocabulary part segmenta- tion. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9782–9793, 2025. 2

work page 2025

[9] [9]

Meta clip 2: A worldwide scaling recipe

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen tau Yih, Shang-Wen Li, and Hu Xu. Meta CLIP 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062, pages 1–10, 2025. 2

work page arXiv 2025

[10] [10]

Describing textures in the wild

Cimpoi, Mircea, Maji, Subhransu, Kokkinos, Iasonas, Mo- hamed, Sammy, Vedaldi, and Andrea. Describing textures in the wild. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3606–3613,

work page

[11] [11]

An analysis of single-layer networks in unsupervised feature learning

Coates, Adam, Ng, Andrew, Lee, and Honglak. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, pages 215–223, 2011. 6

work page 2011

[12] [12]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. InProc. IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 248–255, 2009. 6

work page 2009

[13] [13]

MaskCLIP: Masked self-distillation advances contrastive language-image pretraining

Xiaoyi Dong, Jianmin Bao, Yinglin Zheng, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, and Nenghai Yu. MaskCLIP: Masked self-distillation advances contrastive language-image pretraining. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10995–11005, 2023. 2

work page 2023

[14] [14]

An image is worth 16×16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InProc. International Conference on Learning Repre- sentations (ICLR), 2021. 6

work page 2021

[15] [15]

DIH-CLIP: Unleashing the diversity of Multi-Head Self-Attention for Training-Free Open-V ocabulary semantic segmentation

Songsong Duan, Xi Yang, and Nannan Wang. DIH-CLIP: Unleashing the diversity of Multi-Head Self-Attention for Training-Free Open-V ocabulary semantic segmentation. In Proc. IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 22794–22803, 2025. 2

work page 2025

[16] [16]

Milios, Sageev Oore, and Hassan Saj- jad

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos E. Milios, Sageev Oore, and Hassan Saj- jad. Sugarcrepe++dataset: Vision-language model sen- sitivity to semantic and lexical alterations. InProc. An- nual Conference on Neural Information Processing Systems (NeurIPS), 2024. 2

work page 2024

[17] [17]

Filip: Fine-grained interactive language- image pre-training

Lewei Yao et al. Filip: Fine-grained interactive language- image pre-training. InProc. International Conference on Learning Representations (ICLR), 2022. 1, 2, 4, 6, 7, 8, 5

work page 2022

[18] [18]

The pascal visual object classes (voc) challenge.International Journal of Com- puter Vision (IJCV), 88:303–338, 2010

Everingham, Mark, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, and Andrew. The pascal visual object classes (voc) challenge.International Journal of Com- puter Vision (IJCV), 88:303–338, 2010. 6

work page 2010

[19] [19]

Improving clip training with language rewrites

Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites. InProc. Annual Conference on Neural Information Processing Systems (NeurIPS), pages 35544–35575, 2023. 2

work page 2023

[20] [20]

Learning gen- erative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

Fei-Fei, Li, Fergus, Rob, Perona, and Pietro. Learning gen- erative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 178–178,

work page

[21] [21]

Linguistic-aware patch slimming framework for fine-grained cross-modal alignment

Zheren Fu, Lei Zhang, Hou Xia, and Zhendong Mao. Linguistic-aware patch slimming framework for fine-grained cross-modal alignment. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26297–26306, 2024. 2

work page 2024

[22] [22]

Clip-adapted region-to-text learning for generative open- vocabulary semantic segmentation

Jiannan Ge, Lingxi Xie, Hongtao Xie, Pandeng Li, Sun- Ao Liu, Xiaopeng Zhang, Qi Tian, and Yongdong Zhang. Clip-adapted region-to-text learning for generative open- vocabulary semantic segmentation. InProc. IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 24034–24044, 2025. 2

work page 2025

[23] [23]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProc. IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 16000– 16009, 2022. 2

work page 2022

[24] [24]

Helber, Patrick, Bischke, Benjamin, Dengel, Andreas, Borth, and Damian. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 6

work page 2019

[25] [25]

The many faces of robust- ness: A critical analysis of out-of-distribution generalization

Hendrycks, Dan, Basart, Steven, Mu, Norman, Kadavath, Saurav, Wang, Frank, Dorundo, Evan, Desai, Rahul, Zhu, Tyler, Parajuli, Samyak, Guo, Mike, Song, Dawn, Stein- hardt, Jacob, Gilmer, and Justin. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. InProc. IEEE/CVF International Conference on Computer Vision (ICCV),...

work page 2021

[26] [26]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein- hardt, and Dawn Song. Natural adversarial examples. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 15262–15271, 2021. 6

work page 2021

[27] [27]

SugarCrepe: Fixing hackable benchmarks for vision-language compositionality

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. InProc. Annual Conference on Neural Information Processing Sys- tems (NeurIPS), page 31096–31116, 2023. 2, 6

work page 2023

[28] [28]

Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language repre- sentation learning with noisy text supervision. InProc. In- ternational Conference on Machine Learning (ICML), pages 4904–4916, 2021. 2

work page 2021

[29] [29]

FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding

Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding. InProc. Annual Conference on Neural Infor- mation Processing Systems (NeurIPS), pages 27896–27918,

work page

[30] [30]

V o, Patrick Labatut, and Piotr Bo- janowski

Cijo Jose, Th’eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth’ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha”el Ramamonjisoa, Maxime Oquab, Ori- ane Sim’eoni, Huy V . V o, Patrick Labatut, and Piotr Bo- janowski. DINOv2 meets text: A unified framework for image- and pixel-level vision-language alignment. InProc. IEEE/CVF Conference on Comput...

work page 2025

[31] [31]

Is CLIP ideal? no

Raphi Kang, Yue Song, Georgia Gkioxari, and Pietro Perona. Is CLIP ideal? no. can we fix it? yes! InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 22436–22446, 2025. 2

work page 2025

[32] [32]

3d object representations for fine-grained categorization

Krause, Jonathan, Stark, Michael, Deng, Jia, Fei-Fei, and Li. 3d object representations for fine-grained categorization. InProc. IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. 6

work page 2013

[33] [33]

Learning multiple layers of features from tiny images.Technical Report and University of Tront,

Krizhevsky and Alex. Learning multiple layers of features from tiny images.Technical Report and University of Tront,

work page

[34] [34]

VeCLIP: Improving clip training via visual-enriched cap- tions

Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiu- long Shan, Chen-Nee Chuah, Yinfei Yang, and Meng Cao. VeCLIP: Improving clip training via visual-enriched cap- tions. InProc. European Conference on Computer Vision (ECCV), pages 111–127, 2024. 2

work page 2024

[35] [35]

Scaling language-image pre- training via masking

Li, Yanghao, Fan, Haoqi, Hu, Ronghang, Feichtenhofer, Christoph, He, and Kaiming. Scaling language-image pre- training via masking. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23390–23400, 2023. 2, 6, 7, 8, 4

work page 2023

[36] [36]

Grounded language-image pre-training

Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2022. 2

work page 2022

[37] [37]

Mask-Adapter: The Devil is in the Masks for Open-V ocabulary Segmentation

Yongkang Li, Tianheng Cheng, Bin Feng, Wenyu Liu, and Xinggang Wang. Mask-Adapter: The Devil is in the Masks for Open-V ocabulary Segmentation. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14998–15008, 2025. 2

work page 2025

[38] [38]

Unbiased Region– Language alignment for Open-V ocabulary dense prediction

Yunheng Li, Yuxuan Li, Quan-Sheng Zeng, Wenhai Wang, Qibin Hou, and Ming-Ming Cheng. Unbiased Region– Language alignment for Open-V ocabulary dense prediction. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 23795–23805, 2025. 2

work page 2025

[39] [39]

Lawrence

Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll ´ar, Piotr, Zit- nick, and C. Lawrence. Microsoft coco: Common objects in context. InProc. European Conference on Computer Vision (ECCV), pages 740–755, 2014. 6

work page 2014

[40] [40]

Fine-Grained Visual Classification of Aircraft

Maji, Subhransu, Rahtu, Esa, Kannala, Juho, Blaschko, Matthew, Vedaldi, and Andrea. Fine-grained visual classi- fication of aircraft.arXiv preprint arXiv:1306.5151, 2013. 6

work page internal anchor Pith review Pith/arXiv arXiv 2013

[41] [41]

SLIP: Self-supervision meets language-image pre- training

Norman Mu, Alexander Kirillov, David Wagner, and Sain- ing Xie. SLIP: Self-supervision meets language-image pre- training. InProc. European Conference on Computer Vision (ECCV), pages 529–544, 2022. 2

work page 2022

[42] [42]

Open vocabulary semantic segmentation with patch aligned contrastive learning

Mukhoti, Jishnu, Lin, Tsung-Yu, Poursaeed, Omid, Wang, Rui, Shah, Ashish, Torr, Philip H.S., Lim, and Ser-Nam. Open vocabulary semantic segmentation with patch aligned contrastive learning. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

work page 2023

[43] [43]

Automated flower classification over a large number of classes

Nilsback, Maria-Elena, Zisserman, and Andrew. Automated flower classification over a large number of classes. InProc. Indian Conference on Computer Vision and Graphics & Im- age Processing, pages 722–729, 2008. 6

work page 2008

[44] [44]

Know “No” better: A data- driven approach for enhancing negation awareness in CLIP

Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know “No” better: A data- driven approach for enhancing negation awareness in CLIP. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 2825–2835, 2025. 2

work page 2025

[45] [45]

Cats and dogs

Parkhi, Omkar M, Vedaldi, Andrea, Zisserman, Andrew, Jawahar, and CV . Cats and dogs. InProc. IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 3498–3505, 2012. 6

work page 2012

[46] [46]

TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives

Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives. InProc. Annual Conference on Neural Information Processing Systems (NeurIPS), pages 32731–32760, 2024. 1, 2

work page 2024

[47] [47]

Seeing what matters: Empowering CLIP with patch generation-to-selection

Gensheng Pei, Tao Chen, Yujia Wang, Xinhao Cai, Xiangbo Shu, Tianfei Zhou, and Yazhou Yao. Seeing what matters: Empowering CLIP with patch generation-to-selection. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 24862–24872, 2025. 1, 2, 6, 7, 8, 4

work page 2025

[48] [48]

Parameter-efficient fine-tuning in hyperspherical space for open-vocabularysemantic segmen- tation

Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yu Huang, Yaom- ing Wang, and Wei Shen. Parameter-efficient fine-tuning in hyperspherical space for open-vocabularysemantic segmen- tation. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15009–15020, 2025. 2

work page 2025

[49] [49]

Learn- ing transferable visual models from natural language super- vision

Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, et al. Learn- ing transferable visual models from natural language super- vision. InProc. International Conference on Machine Learn- ing (ICML), pages 8748–8763, 2021. 1, 2, 6, 7, 8, 4, 5

work page 2021

[50] [50]

Sam 2: Segment anything in images and videos

Ravi, Nikhila, Gabeur, Valentin, Hu, Yuan-Ting, Hu, Rong- hang, Ryali, Chaitanya, Ma, Tengyu, Khedr, Haitham, R¨adle, Roman, Rolland, Chloe, Gustafson, Laura, Mintun, Eric, Pan, Junting, Alwala, Kalyan Vasudev, Carion, Nicolas, Wu, Chao-Yuan, Girshick, Ross, Doll´ar, Piotr, Feichtenhofer, and Christoph. Sam 2: Segment anything in images and videos. InProc...

work page 2025

[51] [51]

Do imagenet classifiers generalize to ima- genet? InProc

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to ima- genet? InProc. International Conference on Machine Learn- ing (ICML), pages 5389–5400, 2019. 6

work page 2019

[52] [52]

The german traffic sign recognition bench- mark: a multi-class classification competition

Stallkamp, Johannes, Schlipsing, Marc, Salmen, Jan, Igel, and Christian. The german traffic sign recognition bench- mark: a multi-class classification competition. InIJCNN, pages 1453–1460, 2011. 6

work page 2011

[53] [53]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EV A-CLIP: Improved training techniques for CLIP at scale.arXiv preprint arXiv:2303.15389, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Alpha- CLIP: A CLIP model focusing on wherever you want

Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha- CLIP: A CLIP model focusing on wherever you want. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 13019–13029, 2024. 2

work page 2024

[55] [55]

Winoground: Probing vision and language models for visio- linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 5228–5238, 2022. 2, 6

work page 2022

[56] [56]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H’enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision- language encoders with improved semantic understanding and localization and and dense fe...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Rotation equivariant cnns for digital pathology

Veeling, Bastiaan S, Linmans, Jasper, Winkens, Jim, Cohen, Taco, Welling, and Max. Rotation equivariant cnns for digital pathology. InMICCAI, pages 210–218, 2018. 6

work page 2018

[58] [58]

Fix- clip: Dual-branch hierarchical contrastive learning via syn- thetic captions for better understanding of long text

Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, and Wei Liu. Fix- clip: Dual-branch hierarchical contrastive learning via syn- thetic captions for better understanding of long text. In Proc. IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 20694–20704, 2025. 2

work page 2025

[59] [59]

Lipton, and Eric P

Haohan Wang, Songwei Ge, Zachary C. Lipton, and Eric P. Xing. Learning robust global representations by penaliz- ing local predictive power. InProc. Annual Conference on Neural Information Processing Systems (NeurIPS), pages 13754–13764, 2019. 6

work page 2019

[60] [60]

Efficient vision-language pre-training by cluster masking

Wei, Zihao, Pan, Zixuan, Owens, and Andrew. Efficient vision-language pre-training by cluster masking. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26815–26825, 2024. 1, 2, 6, 7, 8, 4

work page 2024

[61] [61]

MaskFeat: Masked feature prediction for self-supervised visual pre-training

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. MaskFeat: Masked feature prediction for self-supervised visual pre-training. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 14668–14678, 2022. 2

work page 2022

[62] [62]

Hq-clip: Leveraging large Vision-Language models to create high-quality image- text datasets and CLIP models

Zhixiang Wei, Guangting Wang, Xiaoxiao Ma, Ke Mei, Hua- ian Chen, Yi Jin, and Fengyun Rao. Hq-clip: Leveraging large Vision-Language models to create high-quality image- text datasets and CLIP models. InProc. IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 22447–22456, 2025. 2

work page 2025

[63] [63]

Sun database: Large-scale scene recognition from abbey to zoo

Xiao, Jianxiong, Hays, James, Ehinger, Krista A, Oliva, Aude, Torralba, and Antonio. Sun database: Large-scale scene recognition from abbey to zoo. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3485–3492, 2010. 6

work page 2010

[64] [64]

Fg- clip: Fine-grained visual and textual alignment

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. Fg- clip: Fine-grained visual and textual alignment. InProc. In- ternational Conference on Machine Learning (ICML), 2025. 2

work page 2025

[65] [65]

SimMIM: A simple framework for masked image modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A simple framework for masked image modeling. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9643–9653, 2022. 2

work page 2022

[66] [66]

Demystify- ing CLIP data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystify- ing CLIP data. InProc. International Conference on Learn- ing Representations (ICLR), 2024. 2

work page 2024

[67] [67]

Attentive mask clip

Yang, Yifan, Huang, Weiquan, Wei, Yixuan, Peng, Houwen, Jiang, Xinyang, Jiang, Huiqiang, Wei, Fangyun, Wang, Yin, Hu, Han, Qiu, Lili, et al. Attentive mask clip. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 2771–2781, 2023. 1, 2, 6, 7, 8, 4

work page 2023

[68] [68]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 6

work page 2014

[69] [69]

CoCa: Contrastive captioners are image-text foundation models.Transactions on Machine Learning Research (TMLR), pages 1–20, 2022

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. CoCa: Contrastive captioners are image-text foundation models.Transactions on Machine Learning Research (TMLR), pages 1–20, 2022. 2

work page 2022

[70] [70]

When and why vision- language models behave like bags-of-words and and what to do about it? InProc

Mert Y ¨uksekg¨on¨ul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words and and what to do about it? InProc. International Conference on Learning Representations (ICLR), 2023. 2

work page 2023

[71] [71]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023. 2

work page 2023

[72] [72]

Long-CLIP: Unlocking the long-text capability of CLIP

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-CLIP: Unlocking the long-text capability of CLIP. InProc. European Conference on Computer Vision (ECCV), pages 310–325, 2024. 2

work page 2024

[73] [73]

Corrclip: Recon- structing patch correlations in CLIP for open-vocabulary se- mantic segmentation

Dengke Zhang, Fagui Liu, and Quan Tang. Corrclip: Recon- structing patch correlations in CLIP for open-vocabulary se- mantic segmentation. InProc. IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 24677–24687,

work page

[74] [74]

Re- gionCLIP: Region-based Language-Image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chun- yuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Re- gionCLIP: Region-based Language-Image pretraining. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 16793–16803, 2022. 2

work page 2022

[75] [75]

a horse"

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from CLIP. InProc. European Conference on Computer Vision (ECCV), pages 696–712, 2022. 2 PowerCLIP: Powerset Alignment for Contrastive Pre-Training Supplementary Material Appendix A. Proof of Theorem 1 In this section, we present a proof of Theorem 1. We first restate the definitions of th...

work page 2022