PowerCLIP: Powerset Alignment for Contrastive Pre-Training
Pith reviewed 2026-05-17 04:54 UTC · model grok-4.3
The pith
PowerCLIP aligns every subset of image regions with text phrases from parse trees to capture multi-part semantics during contrastive pre-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PowerCLIP minimizes a contrastive loss defined between the powerset of image regions and the parse tree of the accompanying text; non-linear aggregators reduce the cost from exponential to linear in the number of regions while preserving arbitrary approximation accuracy to the exact loss.
What carries the argument
Powerset alignment between image-region subsets and textual parse-tree phrases, made tractable by non-linear aggregators that replace full enumeration.
If this is right
- Zero-shot classification accuracy rises on tasks that require understanding relations among several image parts.
- Image-to-text and text-to-image retrieval improve when queries involve compositional descriptions.
- The learned representations become more robust to variations in how objects are grouped within scenes.
- Training remains practical because the added alignment step scales linearly rather than exponentially.
Where Pith is reading between the lines
- The same subset-alignment idea could be tested on video or audio where multiple elements must be matched to phrases.
- If the approximation works well, it opens the door to hierarchical or tree-structured alignments in other contrastive frameworks.
- Performance on fine-grained benchmarks could serve as a practical test of whether the approximated loss retains the key compositional signal.
Load-bearing premise
Non-linear aggregators can approximate the exact powerset loss arbitrarily closely while keeping computation linear in the number of regions.
What would settle it
Compute the exact powerset loss on a toy dataset with few regions and compare it directly to the aggregator output; large divergence would indicate the approximation fails to support the claimed gains.
Figures
read the original abstract
Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Code is available at https://github.com/Masakichi210/PowerCLIP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PowerCLIP, a contrastive vision-language pre-training method that augments CLIP with powerset alignment: it minimizes a loss over all subsets of image regions matched against phrases from textual parse trees to capture multi-region compositional semantics. To avoid the O(2^M) cost of explicit powerset enumeration, the authors introduce non-linear aggregators (NLAs) claimed to reduce complexity to O(M) while approximating the exact powerset loss to arbitrary precision. Experiments are reported to show improved zero-shot classification and retrieval performance over prior methods.
Significance. If the NLA approximation faithfully preserves the higher-order subset interactions of the exact powerset objective and the empirical gains are reproducible with proper controls, the work could advance fine-grained compositional alignment in VL models. The public code release supports reproducibility.
major comments (2)
- [Non-linear aggregators (NLAs) section] The central claim that powerset alignment drives improved compositionality rests on NLAs approximating the exact loss with arbitrary precision. No formal error bound, convergence analysis, or empirical measurement of approximation error (e.g., difference between NLA and brute-force powerset loss on small M) is provided to confirm that subset-interaction terms are preserved; without this, the optimized objective may diverge from the stated powerset construction.
- [Abstract and Experiments] The abstract states that extensive experiments demonstrate outperformance on zero-shot tasks, yet no quantitative results, error bars, dataset details, or ablations (e.g., full powerset vs. NLA, or NLA error vs. downstream gains) are supplied. This leaves the load-bearing claim that the powerset mechanism (rather than the NLA heuristic) produces the reported benefits unverified.
minor comments (2)
- [Method overview] Clarify early how textual parse trees are obtained and how region proposals are generated, including any hyperparameters that affect M.
- [NLAs definition] The claim of 'arbitrary precision' approximation should be accompanied by a concrete statement of the approximation scheme (e.g., which non-linear functions are used and under what conditions the error vanishes).
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, acknowledging where the manuscript can be strengthened through revisions while defending the core technical contributions on substantive grounds.
read point-by-point responses
-
Referee: [Non-linear aggregators (NLAs) section] The central claim that powerset alignment drives improved compositionality rests on NLAs approximating the exact loss with arbitrary precision. No formal error bound, convergence analysis, or empirical measurement of approximation error (e.g., difference between NLA and brute-force powerset loss on small M) is provided to confirm that subset-interaction terms are preserved; without this, the optimized objective may diverge from the stated powerset construction.
Authors: We agree that the manuscript would benefit from explicit validation of the NLA approximation. The NLAs are constructed to preserve higher-order subset interactions via non-linear pooling that approximates the combinatorial sum in the powerset loss; however, the current version does not include formal error bounds or direct empirical comparisons to brute-force enumeration. In the revised manuscript we will add a new subsection under the NLA description that derives a bound on the approximation error under Lipschitz assumptions on the aggregator functions and reports empirical loss differences for small M (M ≤ 5) on held-out image-text pairs, confirming that the dominant interaction terms are retained. revision: yes
-
Referee: [Abstract and Experiments] The abstract states that extensive experiments demonstrate outperformance on zero-shot tasks, yet no quantitative results, error bars, dataset details, or ablations (e.g., full powerset vs. NLA, or NLA error vs. downstream gains) are supplied. This leaves the load-bearing claim that the powerset mechanism (rather than the NLA heuristic) produces the reported benefits unverified.
Authors: Abstracts are intentionally concise and do not contain numerical results or error bars; those appear in the Experiments section. We nevertheless recognize that additional controls are needed to isolate the powerset contribution. The revised manuscript will expand the Experiments section with (i) full quantitative tables including standard deviations over multiple seeds, (ii) explicit dataset and hyper-parameter details, and (iii) new ablations that compare NLA against exact powerset loss (feasible for small M) and plot downstream gains against measured approximation error, thereby verifying that performance improvements track the powerset objective rather than the aggregator implementation alone. revision: yes
Circularity Check
No significant circularity detected
full rationale
The derivation begins from the standard CLIP contrastive objective and defines a new powerset alignment loss directly over region subsets and parse-tree phrases. The non-linear aggregators are introduced as a computational reduction that approximates this loss, with the overall framework validated through independent zero-shot classification and retrieval experiments rather than any self-referential fit, redefinition, or load-bearing self-citation chain. No equation or claim reduces the reported gains to a parameter fitted from the target data or to a prior result whose justification collapses back into the current paper. The approach remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard contrastive loss framework from CLIP-style models
invented entities (1)
-
Non-linear aggregators (NLAs)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
NLAs ... reduce complexity from O(2^M) to O(M) ... approximating the exact loss value with arbitrary precision (Theorems 1 and 2).
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
NLA-T1 with Softplus approximates T2R max aggregation; NLA-T2 with tanh interpolates R2T bounds.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Finelip: Extending clip’s reach via fine-grained alignment with longer text inputs
Mothilal Asokan, Kebin Wu, and Fatima Albreiki. Finelip: Extending clip’s reach via fine-grained alignment with longer text inputs. InProc. IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025. 1, 2, 4
work page 2025
-
[2]
Learning local feature descriptors with triplets and shallow convolutional neural networks
Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. Learning local feature descriptors with triplets and shallow convolutional neural networks. InProc. British Machine Vision Conference (BMVC), 2016. 4
work page 2016
-
[3]
Grit- senko, Matthias Minderer, Charles Blundell, Razvan Pas- canu, and Jovana Mitrovi’c
Ioana Bica, Anastasija Ili’c, Matthias Bauer, G”oker Erdo- gan, Matko Bo ˇsnjak, Christos Kaplanis, Alexey A. Grit- senko, Matthias Minderer, Charles Blundell, Razvan Pas- canu, and Jovana Mitrovi’c. Improving fine-grained under- standing in image-text pre-training. InProc. International Conference on Machine Learning (ICML), pages 3974– 3995, 2024. 1, 2,...
work page 2024
-
[4]
Food-101–mining discriminative components with random forests
Bossard, Lukas, Guillaumin, Matthieu, Van Gool, and Luc. Food-101–mining discriminative components with random forests. InProc. European Conference on Computer Vision (ECCV), pages 446–461, 2014. 6
work page 2014
-
[5]
Conceptual 12m: Pushing web-scale image- text pre-training to recognize long-tail visual concepts
Changpinyo, Soravit, Sharma, Piyush, Ding, Nan, Soricut, and Radu. Conceptual 12m: Pushing web-scale image- text pre-training to recognize long-tail visual concepts. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 3558–3568, 2021. 6
work page 2021
-
[6]
Cheng, Gong, Han, Junwei, Lu, and Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017. 6
work page 2017
-
[7]
Goal: Global-local object alignment learning
Choi, Hyungyu, Jang, Young Kyun, Eom, and Chanho. Goal: Global-local object alignment learning. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4070–4079, 2025. 1, 2
work page 2025
-
[8]
Fine-grained image-text correspondence with cost aggregation for open-vocabulary part segmenta- tion
Jiho Choi, Seonho Lee, Minhyun Lee, Seungho Lee, and Hyunjung Shim. Fine-grained image-text correspondence with cost aggregation for open-vocabulary part segmenta- tion. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9782–9793, 2025. 2
work page 2025
-
[9]
Meta clip 2: A worldwide scaling recipe
Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen tau Yih, Shang-Wen Li, and Hu Xu. Meta CLIP 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062, pages 1–10, 2025. 2
-
[10]
Describing textures in the wild
Cimpoi, Mircea, Maji, Subhransu, Kokkinos, Iasonas, Mo- hamed, Sammy, Vedaldi, and Andrea. Describing textures in the wild. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3606–3613,
-
[11]
An analysis of single-layer networks in unsupervised feature learning
Coates, Adam, Ng, Andrew, Lee, and Honglak. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, pages 215–223, 2011. 6
work page 2011
-
[12]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. InProc. IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 248–255, 2009. 6
work page 2009
-
[13]
MaskCLIP: Masked self-distillation advances contrastive language-image pretraining
Xiaoyi Dong, Jianmin Bao, Yinglin Zheng, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, and Nenghai Yu. MaskCLIP: Masked self-distillation advances contrastive language-image pretraining. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10995–11005, 2023. 2
work page 2023
-
[14]
An image is worth 16×16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InProc. International Conference on Learning Repre- sentations (ICLR), 2021. 6
work page 2021
-
[15]
Songsong Duan, Xi Yang, and Nannan Wang. DIH-CLIP: Unleashing the diversity of Multi-Head Self-Attention for Training-Free Open-V ocabulary semantic segmentation. In Proc. IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 22794–22803, 2025. 2
work page 2025
-
[16]
Milios, Sageev Oore, and Hassan Saj- jad
Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos E. Milios, Sageev Oore, and Hassan Saj- jad. Sugarcrepe++dataset: Vision-language model sen- sitivity to semantic and lexical alterations. InProc. An- nual Conference on Neural Information Processing Systems (NeurIPS), 2024. 2
work page 2024
-
[17]
Filip: Fine-grained interactive language- image pre-training
Lewei Yao et al. Filip: Fine-grained interactive language- image pre-training. InProc. International Conference on Learning Representations (ICLR), 2022. 1, 2, 4, 6, 7, 8, 5
work page 2022
-
[18]
Everingham, Mark, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, and Andrew. The pascal visual object classes (voc) challenge.International Journal of Com- puter Vision (IJCV), 88:303–338, 2010. 6
work page 2010
-
[19]
Improving clip training with language rewrites
Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites. InProc. Annual Conference on Neural Information Processing Systems (NeurIPS), pages 35544–35575, 2023. 2
work page 2023
-
[20]
Fei-Fei, Li, Fergus, Rob, Perona, and Pietro. Learning gen- erative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 178–178,
-
[21]
Linguistic-aware patch slimming framework for fine-grained cross-modal alignment
Zheren Fu, Lei Zhang, Hou Xia, and Zhendong Mao. Linguistic-aware patch slimming framework for fine-grained cross-modal alignment. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26297–26306, 2024. 2
work page 2024
-
[22]
Clip-adapted region-to-text learning for generative open- vocabulary semantic segmentation
Jiannan Ge, Lingxi Xie, Hongtao Xie, Pandeng Li, Sun- Ao Liu, Xiaopeng Zhang, Qi Tian, and Yongdong Zhang. Clip-adapted region-to-text learning for generative open- vocabulary semantic segmentation. InProc. IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 24034–24044, 2025. 2
work page 2025
-
[23]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProc. IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 16000– 16009, 2022. 2
work page 2022
-
[24]
Helber, Patrick, Bischke, Benjamin, Dengel, Andreas, Borth, and Damian. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 6
work page 2019
-
[25]
The many faces of robust- ness: A critical analysis of out-of-distribution generalization
Hendrycks, Dan, Basart, Steven, Mu, Norman, Kadavath, Saurav, Wang, Frank, Dorundo, Evan, Desai, Rahul, Zhu, Tyler, Parajuli, Samyak, Guo, Mike, Song, Dawn, Stein- hardt, Jacob, Gilmer, and Justin. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. InProc. IEEE/CVF International Conference on Computer Vision (ICCV),...
work page 2021
-
[26]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein- hardt, and Dawn Song. Natural adversarial examples. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 15262–15271, 2021. 6
work page 2021
-
[27]
SugarCrepe: Fixing hackable benchmarks for vision-language compositionality
Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. InProc. Annual Conference on Neural Information Processing Sys- tems (NeurIPS), page 31096–31116, 2023. 2, 6
work page 2023
-
[28]
Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language repre- sentation learning with noisy text supervision. InProc. In- ternational Conference on Machine Learning (ICML), pages 4904–4916, 2021. 2
work page 2021
-
[29]
FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding
Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding. InProc. Annual Conference on Neural Infor- mation Processing Systems (NeurIPS), pages 27896–27918,
-
[30]
V o, Patrick Labatut, and Piotr Bo- janowski
Cijo Jose, Th’eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth’ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha”el Ramamonjisoa, Maxime Oquab, Ori- ane Sim’eoni, Huy V . V o, Patrick Labatut, and Piotr Bo- janowski. DINOv2 meets text: A unified framework for image- and pixel-level vision-language alignment. InProc. IEEE/CVF Conference on Comput...
work page 2025
-
[31]
Raphi Kang, Yue Song, Georgia Gkioxari, and Pietro Perona. Is CLIP ideal? no. can we fix it? yes! InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 22436–22446, 2025. 2
work page 2025
-
[32]
3d object representations for fine-grained categorization
Krause, Jonathan, Stark, Michael, Deng, Jia, Fei-Fei, and Li. 3d object representations for fine-grained categorization. InProc. IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. 6
work page 2013
-
[33]
Learning multiple layers of features from tiny images.Technical Report and University of Tront,
Krizhevsky and Alex. Learning multiple layers of features from tiny images.Technical Report and University of Tront,
-
[34]
VeCLIP: Improving clip training via visual-enriched cap- tions
Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiu- long Shan, Chen-Nee Chuah, Yinfei Yang, and Meng Cao. VeCLIP: Improving clip training via visual-enriched cap- tions. InProc. European Conference on Computer Vision (ECCV), pages 111–127, 2024. 2
work page 2024
-
[35]
Scaling language-image pre- training via masking
Li, Yanghao, Fan, Haoqi, Hu, Ronghang, Feichtenhofer, Christoph, He, and Kaiming. Scaling language-image pre- training via masking. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23390–23400, 2023. 2, 6, 7, 8, 4
work page 2023
-
[36]
Grounded language-image pre-training
Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2022. 2
work page 2022
-
[37]
Mask-Adapter: The Devil is in the Masks for Open-V ocabulary Segmentation
Yongkang Li, Tianheng Cheng, Bin Feng, Wenyu Liu, and Xinggang Wang. Mask-Adapter: The Devil is in the Masks for Open-V ocabulary Segmentation. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14998–15008, 2025. 2
work page 2025
-
[38]
Unbiased Region– Language alignment for Open-V ocabulary dense prediction
Yunheng Li, Yuxuan Li, Quan-Sheng Zeng, Wenhai Wang, Qibin Hou, and Ming-Ming Cheng. Unbiased Region– Language alignment for Open-V ocabulary dense prediction. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 23795–23805, 2025. 2
work page 2025
- [39]
-
[40]
Fine-Grained Visual Classification of Aircraft
Maji, Subhransu, Rahtu, Esa, Kannala, Juho, Blaschko, Matthew, Vedaldi, and Andrea. Fine-grained visual classi- fication of aircraft.arXiv preprint arXiv:1306.5151, 2013. 6
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[41]
SLIP: Self-supervision meets language-image pre- training
Norman Mu, Alexander Kirillov, David Wagner, and Sain- ing Xie. SLIP: Self-supervision meets language-image pre- training. InProc. European Conference on Computer Vision (ECCV), pages 529–544, 2022. 2
work page 2022
-
[42]
Open vocabulary semantic segmentation with patch aligned contrastive learning
Mukhoti, Jishnu, Lin, Tsung-Yu, Poursaeed, Omid, Wang, Rui, Shah, Ashish, Torr, Philip H.S., Lim, and Ser-Nam. Open vocabulary semantic segmentation with patch aligned contrastive learning. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2
work page 2023
-
[43]
Automated flower classification over a large number of classes
Nilsback, Maria-Elena, Zisserman, and Andrew. Automated flower classification over a large number of classes. InProc. Indian Conference on Computer Vision and Graphics & Im- age Processing, pages 722–729, 2008. 6
work page 2008
-
[44]
Know “No” better: A data- driven approach for enhancing negation awareness in CLIP
Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know “No” better: A data- driven approach for enhancing negation awareness in CLIP. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 2825–2835, 2025. 2
work page 2025
-
[45]
Parkhi, Omkar M, Vedaldi, Andrea, Zisserman, Andrew, Jawahar, and CV . Cats and dogs. InProc. IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 3498–3505, 2012. 6
work page 2012
-
[46]
TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives
Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives. InProc. Annual Conference on Neural Information Processing Systems (NeurIPS), pages 32731–32760, 2024. 1, 2
work page 2024
-
[47]
Seeing what matters: Empowering CLIP with patch generation-to-selection
Gensheng Pei, Tao Chen, Yujia Wang, Xinhao Cai, Xiangbo Shu, Tianfei Zhou, and Yazhou Yao. Seeing what matters: Empowering CLIP with patch generation-to-selection. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 24862–24872, 2025. 1, 2, 6, 7, 8, 4
work page 2025
-
[48]
Parameter-efficient fine-tuning in hyperspherical space for open-vocabularysemantic segmen- tation
Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yu Huang, Yaom- ing Wang, and Wei Shen. Parameter-efficient fine-tuning in hyperspherical space for open-vocabularysemantic segmen- tation. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15009–15020, 2025. 2
work page 2025
-
[49]
Learn- ing transferable visual models from natural language super- vision
Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, et al. Learn- ing transferable visual models from natural language super- vision. InProc. International Conference on Machine Learn- ing (ICML), pages 8748–8763, 2021. 1, 2, 6, 7, 8, 4, 5
work page 2021
-
[50]
Sam 2: Segment anything in images and videos
Ravi, Nikhila, Gabeur, Valentin, Hu, Yuan-Ting, Hu, Rong- hang, Ryali, Chaitanya, Ma, Tengyu, Khedr, Haitham, R¨adle, Roman, Rolland, Chloe, Gustafson, Laura, Mintun, Eric, Pan, Junting, Alwala, Kalyan Vasudev, Carion, Nicolas, Wu, Chao-Yuan, Girshick, Ross, Doll´ar, Piotr, Feichtenhofer, and Christoph. Sam 2: Segment anything in images and videos. InProc...
work page 2025
-
[51]
Do imagenet classifiers generalize to ima- genet? InProc
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to ima- genet? InProc. International Conference on Machine Learn- ing (ICML), pages 5389–5400, 2019. 6
work page 2019
-
[52]
The german traffic sign recognition bench- mark: a multi-class classification competition
Stallkamp, Johannes, Schlipsing, Marc, Salmen, Jan, Igel, and Christian. The german traffic sign recognition bench- mark: a multi-class classification competition. InIJCNN, pages 1453–1460, 2011. 6
work page 2011
-
[53]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EV A-CLIP: Improved training techniques for CLIP at scale.arXiv preprint arXiv:2303.15389, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Alpha- CLIP: A CLIP model focusing on wherever you want
Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha- CLIP: A CLIP model focusing on wherever you want. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 13019–13029, 2024. 2
work page 2024
-
[55]
Winoground: Probing vision and language models for visio- linguistic compositionality
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 5228–5238, 2022. 2, 6
work page 2022
-
[56]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H’enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision- language encoders with improved semantic understanding and localization and and dense fe...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Rotation equivariant cnns for digital pathology
Veeling, Bastiaan S, Linmans, Jasper, Winkens, Jim, Cohen, Taco, Welling, and Max. Rotation equivariant cnns for digital pathology. InMICCAI, pages 210–218, 2018. 6
work page 2018
-
[58]
Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, and Wei Liu. Fix- clip: Dual-branch hierarchical contrastive learning via syn- thetic captions for better understanding of long text. In Proc. IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 20694–20704, 2025. 2
work page 2025
-
[59]
Haohan Wang, Songwei Ge, Zachary C. Lipton, and Eric P. Xing. Learning robust global representations by penaliz- ing local predictive power. InProc. Annual Conference on Neural Information Processing Systems (NeurIPS), pages 13754–13764, 2019. 6
work page 2019
-
[60]
Efficient vision-language pre-training by cluster masking
Wei, Zihao, Pan, Zixuan, Owens, and Andrew. Efficient vision-language pre-training by cluster masking. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26815–26825, 2024. 1, 2, 6, 7, 8, 4
work page 2024
-
[61]
MaskFeat: Masked feature prediction for self-supervised visual pre-training
Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. MaskFeat: Masked feature prediction for self-supervised visual pre-training. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 14668–14678, 2022. 2
work page 2022
-
[62]
Zhixiang Wei, Guangting Wang, Xiaoxiao Ma, Ke Mei, Hua- ian Chen, Yi Jin, and Fengyun Rao. Hq-clip: Leveraging large Vision-Language models to create high-quality image- text datasets and CLIP models. InProc. IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 22447–22456, 2025. 2
work page 2025
-
[63]
Sun database: Large-scale scene recognition from abbey to zoo
Xiao, Jianxiong, Hays, James, Ehinger, Krista A, Oliva, Aude, Torralba, and Antonio. Sun database: Large-scale scene recognition from abbey to zoo. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3485–3492, 2010. 6
work page 2010
-
[64]
Fg- clip: Fine-grained visual and textual alignment
Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. Fg- clip: Fine-grained visual and textual alignment. InProc. In- ternational Conference on Machine Learning (ICML), 2025. 2
work page 2025
-
[65]
SimMIM: A simple framework for masked image modeling
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A simple framework for masked image modeling. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9643–9653, 2022. 2
work page 2022
-
[66]
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystify- ing CLIP data. InProc. International Conference on Learn- ing Representations (ICLR), 2024. 2
work page 2024
-
[67]
Yang, Yifan, Huang, Weiquan, Wei, Yixuan, Peng, Houwen, Jiang, Xinyang, Jiang, Huiqiang, Wei, Fangyun, Wang, Yin, Hu, Han, Qiu, Lili, et al. Attentive mask clip. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 2771–2781, 2023. 1, 2, 6, 7, 8, 4
work page 2023
-
[68]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 6
work page 2014
-
[69]
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. CoCa: Contrastive captioners are image-text foundation models.Transactions on Machine Learning Research (TMLR), pages 1–20, 2022. 2
work page 2022
-
[70]
When and why vision- language models behave like bags-of-words and and what to do about it? InProc
Mert Y ¨uksekg¨on¨ul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words and and what to do about it? InProc. International Conference on Learning Representations (ICLR), 2023. 2
work page 2023
-
[71]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023. 2
work page 2023
-
[72]
Long-CLIP: Unlocking the long-text capability of CLIP
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-CLIP: Unlocking the long-text capability of CLIP. InProc. European Conference on Computer Vision (ECCV), pages 310–325, 2024. 2
work page 2024
-
[73]
Corrclip: Recon- structing patch correlations in CLIP for open-vocabulary se- mantic segmentation
Dengke Zhang, Fagui Liu, and Quan Tang. Corrclip: Recon- structing patch correlations in CLIP for open-vocabulary se- mantic segmentation. InProc. IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 24677–24687,
-
[74]
Re- gionCLIP: Region-based Language-Image pretraining
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chun- yuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Re- gionCLIP: Region-based Language-Image pretraining. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 16793–16803, 2022. 2
work page 2022
-
[75]
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from CLIP. InProc. European Conference on Computer Vision (ECCV), pages 696–712, 2022. 2 PowerCLIP: Powerset Alignment for Contrastive Pre-Training Supplementary Material Appendix A. Proof of Theorem 1 In this section, we present a proof of Theorem 1. We first restate the definitions of th...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.