Neutral-Reference Prompting for Vision-Language Models

Senmao Tian; Shunli Zhang; Xiang Wei

arxiv: 2605.15615 · v1 · pith:ISSIX3ZZnew · submitted 2026-05-15 · 💻 cs.CV · cs.LG

Neutral-Reference Prompting for Vision-Language Models

Senmao Tian , Xiang Wei , Shunli Zhang This is my paper

Pith reviewed 2026-05-20 19:10 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords vision-language modelsfew-shot learningpromptingbias correctionunseen class recognitionbase-new trade-offtransfer learningasymmetric confusion

0 comments

The pith

Neutral-Reference Prompting corrects pretraining biases in vision-language models by flipping predictions between confusable classes using neutral prompts and reference images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models suffer from a base-new trade-off where gains on unseen classes often come at the cost of known-class accuracy, partly because of persistent asymmetric confusion induced by pretraining. The paper shows that this confusion appears as one class being systematically mistaken for another without the reverse happening, and that neutral text prompts plus reference images can quantify the resulting class-wise prior preferences along the model's inter-class geometry. NeRP combines these priors with the model's sample likelihood to form surrogate scores and then performs a local flip to the confusable alternative whenever the prior dominates but evidence is weak. Experiments across multiple backbones and 15 few-shot and cross-domain benchmarks demonstrate clear gains on unseen classes while known-class performance stays intact. The approach is parameter-free and plug-and-play, addressing the central challenge of improving generalization without retraining.

Core claim

NeRP is a plug-and-play prompting correction strategy that measures class-wise prior preferences from neutral text prompts and reference images to capture pretraining-induced bias geometry, combines them with sample likelihood to produce surrogate scores, and applies targeted local flips between easily confusable class pairs when the prior strongly favors the current prediction but observed evidence is insufficient, thereby correcting mispredictions on unseen classes without degrading known-class accuracy.

What carries the argument

Neutral-Reference Prompting (NeRP), which measures class-wise prior preferences along the pre-trained inter-class geometry using neutral text prompts and reference images, then combines them with sample likelihood for surrogate scoring and performs local flips on confusable pairs to override prior-dominated errors.

If this is right

Accuracy on unseen classes rises substantially across 15 few-shot and cross-domain benchmarks.
Known-class prediction performance remains unchanged after the correction.
The method requires no parameter updates and works as a plug-and-play addition to existing VLMs.
The same correction applies across multiple different model backbones without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The structured nature of the bias suggests that similar neutral-reference probing could diagnose and correct asymmetries in other multimodal models beyond vision-language tasks.
Extending the approach to automatically discover confusable pairs from data rather than predefined classes would allow application in open-vocabulary settings without manual class lists.
The reliance on geometry along inter-class distances points to a broader opportunity for analyzing pretraining biases as measurable geometric properties rather than isolated errors.

Load-bearing premise

The method assumes that prior preferences measured from neutral prompts and reference images accurately reflect the pretraining bias geometry and that flipping between confusable pairs will fix errors without creating new mistakes on either known or unseen classes.

What would settle it

Running NeRP on a new set of few-shot or cross-domain benchmarks and finding either no gain on unseen classes or a drop in known-class accuracy would falsify the claim that the correction reliably improves discrimination without introducing trade-offs.

Figures

Figures reproduced from arXiv: 2605.15615 by Senmao Tian, Shunli Zhang, Xiang Wei.

**Figure 1.** Figure 1: We observe two distinct types of confusion on downstream data. The first is symmetric confusion: for two similar classes the model struggles to discriminate them, samples from each class are misclassified as the other at comparable rates. The second is asymmetric confusion: the two classes may not be intrinsically ambiguous, yet the model systematically mislabels class A as B far more often than it misla… view at source ↗

**Figure 2.** Figure 2: The pipeline of obtaining prior gaps from the neutral-reference prompt. 2.2. Bias in Pretrained VLMs Recent work has already studied biases in pretrained VLMs. Hamidieh et al. (Hamidieh et al., 2024) proposed the So-BIT vocabulary and showed that CLIP over-associates harmful terms with certain demographics in open-vocabulary retrieval. Alabdulmohsin et al. (Alabdulmohsin et al., 2024) introduced Multi-Mo… view at source ↗

**Figure 3.** Figure 3: The t-SNE visualization of embeddings, and the comparison of between-class variance of per-class mean logits. In particular, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: An explanation of how NeRP corrects mispredictions. Prior-dominated inconsistency. For a test image x with top-1 prediction yˆ(x) = i, and any j ∈ A(i), recall the sample-level margin mij (x) = ℓi(x) − ℓj (x). We interpret Σi,j (D) as a prior logit-style offset for i against j, and mij (x) as the data-dependent logit margin. mij (x) comes from L2-normalized cosine similarities between embeddings, and und… view at source ↗

**Figure 6.** Figure 6: (a) Performance of different K. (b) Few-shot performance on unseen classes. metric confusions (see [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Efficient transfer learning of vision-language models (VLMs) commonly suffers from a Base-New Trade-off (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing known-class performance remains a central challenge. Existing work often simplistically attributes the BNT to overfitting on known classes. We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data, i.e., samples of class A are systematically mispredicted as class B, while the reverse confusion (B to A) rarely occurs. For known classes, this kind of bias can be mitigated by tuning using a cross-entropy loss, but for unseen classes, such pretraining-induced bias persists and harms generalization. Motivated by this, we propose NeRP, a plug-and-play prompting correction strategy that improves discrimination on unseen classes without modifying model parameters. NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model's surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs, thereby correcting prior-dominated mispredictions. Extensive experiments across multiple backbones and 15 few-shot and cross-domain benchmarks show that NeRP substantially improves accuracy on unseen classes while preserving known-class prediction performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NeRP gives a practical parameter-free way to measure and counter asymmetric confusion in VLMs using neutral prompts, but the local flip rule's safety on base classes is the part that still needs tighter checks.

read the letter

The main takeaway is that this paper offers NeRP as a plug-and-play prompting correction for the base-new tradeoff in vision-language models. It measures class-wise prior preferences from neutral text prompts and reference images, combines them with sample likelihood into a surrogate score, and flips predictions between easily confusable pairs when the prior dominates but evidence looks weak. The reported result is better accuracy on unseen classes without degrading known-class performance across multiple backbones and 15 benchmarks.

Referee Report

3 major / 2 minor

Summary. The paper proposes NeRP, a plug-and-play prompting correction for vision-language models addressing the Base-New Trade-off. It identifies asymmetric confusion patterns in VLMs, measures class-wise prior preferences using neutral text prompts and reference images, combines these with sample likelihood into a surrogate score, and applies local flips between easily confusable pairs when the prior dominates but evidence is insufficient. This is claimed to boost unseen-class accuracy while preserving known-class performance, with results reported across multiple backbones and 15 few-shot and cross-domain benchmarks.

Significance. If the mechanism holds, NeRP offers a practical, tuning-free correction for pretraining biases in VLM transfer, which is a persistent issue. The extensive benchmark coverage and focus on asymmetric confusion as a distinct phenomenon from simple overfitting are strengths. The approach could influence prompting strategies if the flip rule proves robust without introducing new errors.

major comments (3)

[§3 (Method)] §3 (Method): The quantitative criterion for 'clearly insufficient' evidence and the exact formula for the surrogate score (prior combined with likelihood) are not provided with explicit thresholds or equations; this is load-bearing because the decision to flip depends on it, and without it the claim that flips correct mispredictions without new base-class errors cannot be verified or reproduced.
[§4 (Experiments, cross-domain results)] §4 (Experiments, cross-domain results): No ablation or analysis shows that confusable pairs identified from neutral prompts remain valid when domain shift alters inter-class geometry; this directly risks violating the preservation of known-class performance, as a flip tuned on source-like priors could increase base-class error rates.
[Table of results (likely Table 2 or 3)] Table of results (likely Table 2 or 3): The manuscript reports broad gains but provides no error bars, standard deviations, or statistical tests across the 15 benchmarks; without these, the 'substantially improves' claim on unseen classes while 'preserving' base performance lacks the rigor needed to support the central no-trade-off assertion.

minor comments (2)

[Abstract] Abstract: The phrase '15 few-shot and cross-domain benchmarks' is used without naming the specific datasets or providing a high-level breakdown; a short table or list in the abstract or introduction would improve readability.
[Notation] Notation: 'Surrogate score' is referenced without a numbered equation defining its computation from prior and likelihood; adding Eq. (X) would clarify the combination step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of NeRP's potential impact. We address each major comment point by point below, providing clarifications and committing to targeted revisions that strengthen reproducibility and rigor without altering the core claims or results.

read point-by-point responses

Referee: [§3 (Method)] §3 (Method): The quantitative criterion for 'clearly insufficient' evidence and the exact formula for the surrogate score (prior combined with likelihood) are not provided with explicit thresholds or equations; this is load-bearing because the decision to flip depends on it, and without it the claim that flips correct mispredictions without new base-class errors cannot be verified or reproduced.

Authors: We agree that explicit equations and thresholds are necessary for full reproducibility and verification of the no-new-errors claim. While §3 describes the surrogate score as the combination of neutral-prompt priors with sample likelihood and the flip condition as occurring when the prior strongly dominates but evidence is insufficient, the precise formulation was presented in prose. In the revision we will add the explicit equations for the surrogate score and the quantitative thresholds used for the 'clearly insufficient' criterion, along with pseudocode for the local flip decision. This will enable direct verification that the mechanism corrects mispredictions on confusable pairs without introducing base-class errors. revision: yes
Referee: [§4 (Experiments, cross-domain results)] §4 (Experiments, cross-domain results): No ablation or analysis shows that confusable pairs identified from neutral prompts remain valid when domain shift alters inter-class geometry; this directly risks violating the preservation of known-class performance, as a flip tuned on source-like priors could increase base-class error rates.

Authors: We acknowledge the value of an explicit ablation on pair stability under domain shift. Our cross-domain benchmark results already demonstrate that base-class accuracy is preserved after applying NeRP, which provides indirect evidence that the neutral-prompt-derived confusable pairs remain effective. To directly address the concern, we will add a targeted ablation in the revised §4 that examines how the identified pairs and flip decisions behave across domain shifts, confirming that no increase in base-class error occurs. revision: yes
Referee: [Table of results (likely Table 2 or 3)] Table of results (likely Table 2 or 3): The manuscript reports broad gains but provides no error bars, standard deviations, or statistical tests across the 15 benchmarks; without these, the 'substantially improves' claim on unseen classes while 'preserving' base performance lacks the rigor needed to support the central no-trade-off assertion.

Authors: We agree that variability measures and statistical tests would increase the rigor of the central no-trade-off claim. In the revised manuscript we will augment the result tables with standard deviations computed over multiple runs and include statistical significance tests (e.g., paired t-tests) for the reported gains on unseen classes and the preservation of base-class performance across the 15 benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity: NeRP is a heuristic correction derived from external measurements

full rationale

The paper's derivation begins from an empirical observation of asymmetric confusion on downstream data and defines NeRP as a plug-and-play rule that measures class-wise prior preferences from neutral text prompts plus reference images (independent of target labels), forms a surrogate score by combining those priors with sample likelihood, and applies a local flip only when the prior dominates while evidence is insufficient. No equation reduces the claimed accuracy gain on new classes to a fitted parameter on the evaluation data itself, nor does any load-bearing step collapse to a self-citation or ansatz imported from the authors' prior work. The method remains a post-hoc correction whose correctness is tested on held-out benchmarks rather than being true by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated assumption that neutral prompts and reference images produce a faithful estimate of pretraining bias geometry that can be safely used for correction; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Asymmetric confusion observed on downstream data reflects a persistent pretraining-induced bias that can be measured via neutral prompts.
Invoked to justify why the prior should override weak sample evidence for unseen classes.

pith-pipeline@v0.9.0 · 5791 in / 1342 out tokens · 31503 ms · 2026-05-20T19:10:53.090120+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model's surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 6 internal anchors

[1]

Clip the bias: How useful is balancing data in multimodal learning?arXiv preprint arXiv:2403.04547,

Alabdulmohsin, I., Wang, X., Steiner, A., Goyal, P., D’Amour, A., and Zhai, X. Clip the bias: How useful is balancing data in multimodal learning?arXiv preprint arXiv:2403.04547,

work page arXiv
[2]

Food-101– mining discriminative components with random forests

Bossard, L., Guillaumin, M., and Van Gool, L. Food-101– mining discriminative components with random forests. InComputer vision–ECCV 2014: 13th European con- ference, zurich, Switzerland, September 6-12, 2014, pro- ceedings, part VI 13, pp. 446–461. Springer,

work page 2014
[3]

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

Chen, Y ., Habibian, A., Benini, L., and Li, Y . Gated re- lational alignment via confidence-based distillation for efficient vlms.arXiv preprint arXiv:2601.22709,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Learning generative visual models from few training examples: An incremen- tal bayesian approach tested on 101 object categories

Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: An incremen- tal bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recogni- tion workshop, pp. 178–178. IEEE,

work page 2004
[5]

Intrinsic bias is predicted by pretraining data and cor- relates with downstream performance in vision-language encoders.arXiv preprint arXiv:2502.07957,

Ghate, K., Slaughter, I., Wilson, K., Diab, M., and Caliskan, A. Intrinsic bias is predicted by pretraining data and cor- relates with downstream performance in vision-language encoders.arXiv preprint arXiv:2502.07957,

work page arXiv
[6]

Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366,

He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366,

work page arXiv
[7]

Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm.arXiv preprint arXiv:2110.05208,

Li, Y ., Liang, F., Zhao, L., Cui, Y ., Ouyang, W., Shao, J., Yu, F., and Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm.arXiv preprint arXiv:2110.05208,

work page arXiv
[8]

Chordedit: One-step low-energy transport for image edit- ing.arXiv preprint arXiv:2602.19083,

Lu, L., Chen, X., Guo, M., Li, S., Wang, J., and Shi, Y . Chordedit: One-step low-energy transport for image edit- ing.arXiv preprint arXiv:2602.19083,

work page arXiv
[9]

Fine-Grained Visual Classification of Aircraft

Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

A., Patras, I., and Purver, M

10 Neutral-Reference Prompting for Vision–Language Models Sahili, Z. A., Patras, I., and Purver, M. A comprehensive social bias audit of contrastive vision language models. arXiv preprint arXiv:2501.13223,

work page arXiv
[11]

MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

Shi, Y ., Xie, Y ., Guo, M., Lu, L., Huang, M., Wang, J., Zhu, Z., Xu, B., and Huang, Z. Mmerror: A benchmark for erroneous reasoning in vision-language models.arXiv preprint arXiv:2601.03331,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K., Zamir, A., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild.ArXiv, abs/1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Large Vision-Language Models Get Lost in Attention

Xi, G., Tian, Y ., Yang, M., Yi, H., Lin, L., Hao, X., Wang, K., and Wang, W. Large vision-language models get lost in attention.arXiv preprint arXiv:2605.05668,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Qwen2.5-1M Technical Report

Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., et al. Qwen2.5-1m technical report.arXiv preprint arXiv:2501.15383,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783,

Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., and Xu, C. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783,

work page arXiv

[1] [1]

Clip the bias: How useful is balancing data in multimodal learning?arXiv preprint arXiv:2403.04547,

Alabdulmohsin, I., Wang, X., Steiner, A., Goyal, P., D’Amour, A., and Zhai, X. Clip the bias: How useful is balancing data in multimodal learning?arXiv preprint arXiv:2403.04547,

work page arXiv

[2] [2]

Food-101– mining discriminative components with random forests

Bossard, L., Guillaumin, M., and Van Gool, L. Food-101– mining discriminative components with random forests. InComputer vision–ECCV 2014: 13th European con- ference, zurich, Switzerland, September 6-12, 2014, pro- ceedings, part VI 13, pp. 446–461. Springer,

work page 2014

[3] [3]

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

Chen, Y ., Habibian, A., Benini, L., and Li, Y . Gated re- lational alignment via confidence-based distillation for efficient vlms.arXiv preprint arXiv:2601.22709,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Learning generative visual models from few training examples: An incremen- tal bayesian approach tested on 101 object categories

Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: An incremen- tal bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recogni- tion workshop, pp. 178–178. IEEE,

work page 2004

[5] [5]

Intrinsic bias is predicted by pretraining data and cor- relates with downstream performance in vision-language encoders.arXiv preprint arXiv:2502.07957,

Ghate, K., Slaughter, I., Wilson, K., Diab, M., and Caliskan, A. Intrinsic bias is predicted by pretraining data and cor- relates with downstream performance in vision-language encoders.arXiv preprint arXiv:2502.07957,

work page arXiv

[6] [6]

Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366,

He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366,

work page arXiv

[7] [7]

Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm.arXiv preprint arXiv:2110.05208,

Li, Y ., Liang, F., Zhao, L., Cui, Y ., Ouyang, W., Shao, J., Yu, F., and Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm.arXiv preprint arXiv:2110.05208,

work page arXiv

[8] [8]

Chordedit: One-step low-energy transport for image edit- ing.arXiv preprint arXiv:2602.19083,

Lu, L., Chen, X., Guo, M., Li, S., Wang, J., and Shi, Y . Chordedit: One-step low-energy transport for image edit- ing.arXiv preprint arXiv:2602.19083,

work page arXiv

[9] [9]

Fine-Grained Visual Classification of Aircraft

Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

A., Patras, I., and Purver, M

10 Neutral-Reference Prompting for Vision–Language Models Sahili, Z. A., Patras, I., and Purver, M. A comprehensive social bias audit of contrastive vision language models. arXiv preprint arXiv:2501.13223,

work page arXiv

[11] [11]

MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

Shi, Y ., Xie, Y ., Guo, M., Lu, L., Huang, M., Wang, J., Zhu, Z., Xu, B., and Huang, Z. Mmerror: A benchmark for erroneous reasoning in vision-language models.arXiv preprint arXiv:2601.03331,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K., Zamir, A., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild.ArXiv, abs/1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Large Vision-Language Models Get Lost in Attention

Xi, G., Tian, Y ., Yang, M., Yi, H., Lin, L., Hao, X., Wang, K., and Wang, W. Large vision-language models get lost in attention.arXiv preprint arXiv:2605.05668,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Qwen2.5-1M Technical Report

Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., et al. Qwen2.5-1m technical report.arXiv preprint arXiv:2501.15383,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783,

Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., and Xu, C. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783,

work page arXiv