arxiv: 2603.08942 · v2 · submitted 2026-03-09 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

BiCLIP: Domain Canonicalization via Structured Geometric Transformation

Pranav Mantini , Shishir K. Shah

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords BiCLIPdomain adaptationvision-language modelsfew-shot learninggeometric transformationcross-modal alignmentcanonicalizationmultimodal features

0 comments

The pith

BiCLIP recovers a canonical geometric transformation from few-shot anchors to align vision-language features across domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that image features drawn from different visual domains stand in a fixed geometric relationship that a small set of labeled examples can recover. Once recovered, this relationship supplies a lightweight linear map that realigns the features of a frozen vision-language model so that text and image embeddings sit closer together in the new domain. The resulting BiCLIP method therefore turns few-shot classification into an exercise in estimating a single structured transform rather than retraining the entire network. Experiments on eleven standard benchmarks show that the same map consistently raises accuracy above prior state-of-the-art adapters while adding almost no parameters. The authors further verify that the recovered maps are nearly orthogonal and that their angular statistics match the geometric predictions made for model-to-model alignment.

Core claim

BiCLIP shows that disparate visual domains are related by a canonicalized geometric transformation recoverable from a handful of anchor samples; applying the estimated map to multimodal features produces a structured alignment that improves cross-modal similarity and yields state-of-the-art few-shot accuracy on eleven benchmarks while preserving the orthogonality properties predicted by earlier geometric analyses.

What carries the argument

The canonicalized geometric transformation recovered from few-shot anchor samples and applied as a targeted linear map to multimodal features.

If this is right

Domain adaptation for vision-language models reduces to estimating one low-parameter map instead of fine-tuning millions of weights.
The same anchor-based procedure can be reused across any pair of domains once the transformation is shown to be stable.
Verification that the learned maps remain orthogonal supplies direct empirical support for the geometric relation previously derived only between independently trained models.
Few-shot performance on benchmarks such as EuroSAT, DTD, and FGVCAircraft follows directly from the quality of the estimated alignment rather than from additional model capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the geometric relation is domain-general, the same recovery step could align modalities other than vision and language without new training objectives.
The approach implies that many apparent domain gaps are low-rank and therefore correctable by a single linear operator rather than by full representation learning.
Continued verification of orthogonality across more domain pairs would strengthen the case that canonical transformations are a universal property of independently trained encoders.

Load-bearing premise

Features from different visual domains are related by a single recoverable geometric transformation that can be estimated accurately from only a few labeled examples.

What would settle it

If the estimated transformation matrix is forced to be the identity yet performance on the eleven benchmarks still rises, or if the learned maps lose orthogonality while accuracy remains high, the geometric-alignment account would be falsified.

Figures

Figures reproduced from arXiv: 2603.08942 by Pranav Mantini, Shishir K. Shah.

**Figure 2.** Figure 2: The BiCLIP Adaptation Framework. Unlike standard CLIP which relies on a fixed dot product, BiCLIP introduces a trainable, structured transformation matrix W between the image and text modalities [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Few-shot performance comparison on various datasets. Our methods BiCLIP (black) and BiSigLIP (red) significantly outperform [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BiCLIP gets SOTA on 11 domain benchmarks with a bilinear transform fitted to few-shot anchors, but the experiments do not test whether the geometric structure itself is necessary.

read the letter

BiCLIP fits a bilinear transform to a small set of few-shot anchors and reports state-of-the-art accuracy on eleven domain benchmarks for vision-language models. The method keeps the parameter count tiny and includes a post-hoc check that the learned matrices stay close to orthogonal. The new piece is the direct application of the canonical transformation idea to domain adaptation rather than just model-to-model alignment. They show that the same geometric framing can be recovered from limited labeled samples and then used to improve cross-modal features. The results look solid on the reported benchmarks, and releasing the code is helpful for anyone who wants to test it on their own data. The main limitation is the lack of controls that would isolate the role of the geometric structure. The paper learns the bilinear transform, measures its orthogonality, and claims that structured alignment is what drives the gains. But there are no ablations against a non-structured low-parameter alternative, such as an unconstrained low-rank matrix or even a simple scaling factor with the same budget. If those baselines match the performance, the geometric hypothesis would be weaker than presented. The fact that the anchors come from the evaluation set itself is also worth noting, though this is common in few-shot settings. The derivations and the verification steps appear consistent with the cited prior work on VLM transformations. No internal contradictions jump out from the description. This paper is aimed at researchers who adapt large vision-language models to narrow domains like satellite imagery or texture classification. Someone looking for a lightweight way to boost zero-shot performance without full fine-tuning would find the numbers and the implementation useful. It is not foundational, but the concrete gains on standard benchmarks make it worth a full review to verify the experimental details and request the missing comparisons. I would recommend sending it to peer review.

Referee Report

3 major / 3 minor

Summary. The paper introduces BiCLIP, a framework that estimates a bilinear geometric transformation from few-shot anchor samples to canonicalize image features across domains in vision-language models, thereby improving cross-modal alignment. It reports consistent state-of-the-art results on 11 benchmarks (e.g., EuroSAT, DTD, FGVCAircraft) with an extremely low parameter footprint, and provides post-hoc empirical verification of orthogonality and angular properties in the learned transformations.

Significance. If the central claim holds after addressing the noted gaps, BiCLIP would offer a highly practical, parameter-efficient method for few-shot domain adaptation in VLMs grounded in geometric insights from prior work on canonical transformations. The low-parameter design and verification of structured properties could influence both theory and practice in multimodal learning, provided the geometric structure is shown to be necessary rather than incidental.

major comments (3)

[Methods] Methods section (transformation estimation procedure): the bilinear parameters are recovered directly from the same few-shot anchor samples subsequently used for evaluation on the benchmarks, creating a circularity that makes it unclear whether reported gains reflect generalization or fitting to the evaluation anchors themselves.
[Experiments] Experiments section (ablation studies): no controls compare the structured geometric transformation (with orthogonality) against simpler low-parameter alternatives such as scalar scaling, diagonal matrices, or unconstrained low-rank updates; without these, it is impossible to isolate whether the claimed geometric structure drives the SOTA results or whether any low-parameter adaptation would suffice.
[Analysis] Analysis section (orthogonality verification): the post-hoc confirmation of orthogonality and angular distributions is correlational and does not establish that these properties are load-bearing for the performance gains; the manuscript should include a controlled test (e.g., enforcing vs. relaxing the geometric constraint) to link structure to accuracy.

minor comments (3)

[Abstract] Abstract: the phrase 'extreme simplicity' would benefit from an explicit statement of the exact parameter count (e.g., number of free parameters in the bilinear map).
[Results] Results tables: include standard error bars or statistical significance tests across the 11 benchmarks to support the SOTA claims.
[Methods] Notation: clarify the precise form of the bilinear transformation (e.g., explicit matrix dimensions and any constraints applied during optimization).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications on our experimental setup and commitments to strengthen the manuscript through additional experiments.

read point-by-point responses

Referee: [Methods] Methods section (transformation estimation procedure): the bilinear parameters are recovered directly from the same few-shot anchor samples subsequently used for evaluation on the benchmarks, creating a circularity that makes it unclear whether reported gains reflect generalization or fitting to the evaluation anchors themselves.

Authors: We clarify that the bilinear transformation is estimated exclusively from the support set (the few-shot anchor samples provided as input), while all reported metrics are computed on a disjoint query set. This follows the standard protocol for few-shot benchmarks such as those used for EuroSAT, DTD, and FGVCAircraft. The separation between support and query ensures the gains reflect generalization of the estimated transformation rather than direct fitting to evaluation samples. revision: no
Referee: [Experiments] Experiments section (ablation studies): no controls compare the structured geometric transformation (with orthogonality) against simpler low-parameter alternatives such as scalar scaling, diagonal matrices, or unconstrained low-rank updates; without these, it is impossible to isolate whether the claimed geometric structure drives the SOTA results or whether any low-parameter adaptation would suffice.

Authors: We agree that the current ablations do not include these direct comparisons. In the revised manuscript we will add controlled experiments evaluating BiCLIP against scalar scaling, diagonal-matrix adaptations, and unconstrained low-rank updates of comparable parameter count to isolate the contribution of the structured bilinear geometric transformation. revision: yes
Referee: [Analysis] Analysis section (orthogonality verification): the post-hoc confirmation of orthogonality and angular distributions is correlational and does not establish that these properties are load-bearing for the performance gains; the manuscript should include a controlled test (e.g., enforcing vs. relaxing the geometric constraint) to link structure to accuracy.

Authors: We acknowledge that the existing verification is post-hoc. We will add a controlled ablation in the revised version that directly compares performance when the orthogonality and angular constraints are enforced versus when they are relaxed (e.g., by optimizing an unconstrained bilinear map), thereby linking the geometric structure to the observed accuracy improvements. revision: yes

Circularity Check

1 steps flagged

Transformation parameters estimated from same few-shot anchors used for evaluation

specific steps

fitted input called prediction [Abstract]
"Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment."

The transformation parameters are estimated directly from the few-shot anchor samples drawn from the 11 evaluation benchmarks; the SOTA results are then measured using that same fitted transform, so the alignment performance is a direct statistical consequence of the fit on the evaluation data rather than an independent prediction.

full rationale

The paper's core claim is that a canonical geometric transformation recovered from few-shot anchors yields SOTA cross-modal alignment. However, the anchors are the limited labeled samples from the evaluation benchmarks themselves, so the reported gains reduce to fitting a low-parameter transform on the same data used to measure performance. No independent derivation or external validation separates the fit from the result. This matches a fitted-input-called-prediction pattern with partial circularity; the geometric verification is post-hoc on the fitted parameters.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the hypothesis that domain features are related by a recoverable canonical geometric transformation, with the transformation parameters fitted from few-shot anchors.

free parameters (1)

Bilinear transformation parameters
Estimated from the small set of anchor samples provided in the few-shot setting.

axioms (1)

domain assumption Independently trained VLMs are related by a canonical transformation that extends to image features across domains
This is presented as an extension of recent theoretical insights cited in the abstract.

pith-pipeline@v0.9.0 · 5516 in / 1265 out tokens · 67662 ms · 2026-05-15T14:09:04.486562+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation... S(i,t)=i W t^T ... upper triangular constraint... orthogonality of the W matrix... normalized Frobenius norm ||W^T W - I||_F /D
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

independently trained VLMs are related by a canonical transformation... structured alignment is the key

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs
cs.CV 2026-05 unverdicted novelty 5.0

GeoStack composes multiple domain experts into VLMs with preserved base knowledge and O(1) inference time via geometric stacking and a weight-folding property.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 5

work page 2014
[2]

Plot: Prompt learning with optimal transport for vision-language models, 2023

Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Plot: Prompt learning with optimal transport for vision-language models, 2023. 3

work page 2023
[3]

Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022

Xi Chen, Josip Djolonga, Piotr Conway, Basil Mustafa, Ibrahim Alabdulmohsin, Kasia Rodge, Golnaz Ghiasi, Ak- shat Shah, Basil Mustafa, et al. Pali: A jointly-scaled multilingual language-image model. InarXiv preprint arXiv:2209.06794, 2022. 5

work page arXiv 2022
[4]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 2, 5

work page 2014
[5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 2, 5

work page 2009
[6]

Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5

work page 2004
[7]

Clip-adapter: Better vision-language models with feature adapters, 2025

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters, 2025. 1, 3

work page 2025
[8]

Domain aligned clip for few-shot classi- fication

Muhammad Waleed Gondal, Jochen Gast, Inigo Alonso Ruiz, Richard Droste, Tommaso Macri, Suren Kumar, and Luitpold Staudigl. Domain aligned clip for few-shot classi- fication. InProceedings of the IEEE/CVF Winter conference on applications of computer vision, pages 5721–5730, 2024. 1, 2

work page 2024
[9]

Open-vocabulary object detection via vision and language knowledge distillation, 2022

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation, 2022. 1

work page 2022
[10]

Canonicalizing multimodal con- trastive representation learning, 2026

Sharut Gupta, Sanyam Kansal, Stefanie Jegelka, Phillip Isola, and Vikas Garg. Canonicalizing multimodal con- trastive representation learning, 2026. 2, 3, 7

work page 2026
[11]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 2, 5

work page 2019
[12]

Unsupervised prompt learning for vision-language models, 2022

Tony Huang, Jack Chu, and Fangyun Wei. Unsupervised prompt learning for vision-language models, 2022. 3

work page 2022
[13]

Maple: Multi-modal prompt learning, 2023

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning, 2023. 1, 2, 3, 6

work page 2023
[14]

Self-regulating prompts: Foundational model adaptation without forgetting, 2023

Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shah- baz Khan. Self-regulating prompts: Foundational model adaptation without forgetting, 2023. 6

work page 2023
[15]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 5

work page 2013
[16]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning, 2022

Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Ye- ung, and James Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning, 2022. 2, 3

work page 2022
[17]

Prompt distribution learning, 2022

Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. Prompt distribution learning, 2022. 3

work page 2022
[18]

Class-agnostic object detection with multi- modal transformer, 2022

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fa- had Shahbaz Khan, Rao Muhammad Anwer, and Ming- Hsuan Yang. Class-agnostic object detection with multi- modal transformer, 2022. 1

work page 2022
[19]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft.arXiv preprint arXiv:1306.5151, 2013. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2013
[20]

Mul- timodality representation learning: A survey on evolution, pretraining and its applications, 2024

Muhammad Arslan Manzoor, Sarah Albarri, Ziting Xian, Za- iqiao Meng, Preslav Nakov, and Shangsong Liang. Mul- timodality representation learning: A survey on evolution, pretraining and its applications, 2024. 1

work page 2024
[21]

Dis- entangling visual and written concepts in clip, 2022

Joanna Materzynska, Antonio Torralba, and David Bau. Dis- entangling visual and written concepts in clip, 2022. 3

work page 2022
[22]

Bagdanov

Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Andrew D. Bagdanov. Cross the gap: Exposing the intra-modal misalignment in clip via modality inversion, 2025. 3

work page 2025
[23]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 5

work page 2008
[24]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 5

work page 2012
[25]

Covariance estimation: The glm and regularization perspectives.Statistical Science, 26(3), 2011

Mohsen Pourahmadi. Covariance estimation: The glm and regularization perspectives.Statistical Science, 26(3), 2011. 4

work page 2011
[26]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 1, 3

work page 2021
[27]

Simone Ricci, Niccol `o Biondi, Federico Pernici, Ioannis Pa- tras, and Alberto Del Bimbo.λ-orthogonality regularization for compatible representation learning, 2025. 3

work page 2025
[28]

Test- time prompt tuning for zero-shot generalization in vision- language models, 2022

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models, 2022. 3

work page 2022
[29]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 5

work page internal anchor Pith review Pith/arXiv arXiv 2012
[30]

Sus-x: Training-free name-only transfer of vision-language models, 2023

Vishaal Udandarao, Ankush Gupta, and Samuel Albanie. Sus-x: Training-free name-only transfer of vision-language models, 2023. 2

work page 2023
[31]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer so- ciety conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010. 5

work page 2010
[32]

Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assess- ment, 2023

Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assess- ment, 2023. 1, 3

work page 2023
[33]

Filip: Fine-grained interactive language-image pre-training, 2021

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training, 2021. 1

work page 2021
[34]

Florence: A new foundation model for computer vision, 2021

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision, 2021

work page 2021
[35]

Lit: Zero-shot transfer with locked-image text tuning, 2022

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning, 2022. 1

work page 2022
[36]

Sigmoid loss for language image pre-training,

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training,

work page
[37]

Tip-adapter: Training-free clip-adapter for better vision- language modeling, 2021

Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision- language modeling, 2021. 2, 3

work page 2021
[38]

Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

work page
[39]

Conditional prompt learning for vision-language models, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Zi- wei Liu. Conditional prompt learning for vision-language models, 2022. 1, 3, 6

work page 2022
[40]

Detecting twenty-thousand classes using image-level supervision, 2022

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Kr¨ahenb¨uhl, and Ishan Misra. Detecting twenty-thousand classes using image-level supervision, 2022. 1

work page 2022