Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP

Antoni B. Chan; Chenyang Zhao; Janet H. Hsiao; Kun Wang

arxiv: 2502.18816 · v2 · submitted 2025-02-26 · 💻 cs.CV

Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP

Chenyang Zhao , Kun Wang , Janet H. Hsiao , Antoni B. Chan This is my paper

Pith reviewed 2026-05-23 02:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords CLIPinterpretabilitygradient explanationheat mapsvision language modelfine-tuningattribution

0 comments

The pith

Grad-ECLIP generates heat maps for CLIP by weighting token features with gradient-derived channel and spatial weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Grad-ECLIP to interpret CLIP's image-text matching decisions. It decomposes the encoders to connect similarity scores to intermediate features and applies gradient-based weights to produce heat maps for images and text. This differs from attention-map methods because CLIP's attention is sparse. Evaluations show it produces higher quality explanations than prior techniques. The maps also support analysis of CLIP and an application to improve fine-grained alignment during fine-tuning.

Core claim

By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results by applying channel and spatial weights on token features.

What carries the argument

Gradient-derived channel and spatial weights applied to token features in CLIP encoders to generate visual and textual explanation heat maps.

If this is right

Reveals the working mechanism of image-text matching in CLIP.
Identifies strengths and limitations in CLIP's attribution identification.
Explores the relationship between a word's concreteness and its usage in CLIP.
Boosts fine-grained alignment in CLIP fine-tuning using text-specific saliency regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gradient weighting could improve explanations in other multimodal models.
The method might enable better debugging of vision-language models.
Explanations could be integrated into training to enforce better alignment automatically.

Load-bearing premise

That applying gradient-based channel and spatial weights to token features will yield more faithful and higher-quality explanations than attention maps despite their sparsity in CLIP.

What would settle it

If quantitative metrics like insertion or deletion scores or human agreement show no improvement over state-of-the-art attention-based methods on CLIP explanation tasks.

Figures

Figures reproduced from arXiv: 2502.18816 by Antoni B. Chan, Chenyang Zhao, Janet H. Hsiao, Kun Wang.

**Figure 2.** Figure 2: Illustration of Grad-ECLIP. An image-text pair specific visual [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of heat maps from: (a) the raw self-attention map in the last ViT layer; (b) Rollout [14]; (c) Grad-CAM [20]; (d) GAME [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Explanations for image-text pairs from MS COCO using: by (a) raw self-attention; Transformer interpretation methods (b) Rollout, (c) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of visual explanations for different methods on image samples from different image domains: natural images (ImageNet), [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Grad-ECLIP visual explanations of the ViT classifier and BLIP. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visual explanations of the CNN-based CLIP with ResNet50 [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Classification accuracy at Top-1 vs. (a) Deletion steps [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 10.** Figure 10: The image visual explanations generated when aggregating over N layers of the image transformer encoder. 𝑁𝑁= 1 𝑁𝑁= 2 𝑁𝑁= 4 𝑁𝑁= 6 𝑁𝑁= 8 𝑁𝑁= 12 𝑁𝑁= 10 [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 13.** Figure 13: Visual explanation heat maps generated for single words and [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: Visual explanations on image matching with different kinds [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

**Figure 15.** Figure 15: The average word importance (via Grad-ECLIP) vs. word [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

**Figure 16.** Figure 16: The average word importance (via Grad-ECLIP) vs. the word frequency in the OpenWordText dataset for the top-1000 most frequent [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

**Figure 17.** Figure 17: Overview of the proposed fine-grained fine-tuning of CLIP [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗

read the original abstract

Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the effectiveness and superiority of Grad-ECLIP compared with the state-of-the-art methods. Furthermore, a series of analysis are conducted based on our visual and textual explanation results, from which we explore the working mechanism of image-text matching, the strengths and limitations in attribution identification of CLIP, and the relationship between the concreteness/abstractness of a word and its usage in CLIP. Finally, based on the ability of explanation map that indicates text-specific saliency region of input image, we also propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning. The code of Grad-ECLIP is available here: https://github.com/Cyang-Zhao/Grad-Eclip.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Grad-ECLIP gives a direct gradient-weighting method for CLIP explanations that sidesteps sparse attention maps and adds a fine-tuning application, but the superiority claims rest on evaluations that skip standard faithfulness checks.

read the letter

Grad-ECLIP introduces gradient-derived channel and spatial weights applied to token features to produce heatmaps for image regions and words in CLIP similarity scores. This is the main new piece: it decomposes the encoder to connect matching similarity to intermediate features and explicitly avoids self-attention because those maps are too sparse in CLIP. The paper also turns the maps into an application that uses text-specific saliency to improve fine-grained alignment during fine-tuning, and the code is released on GitHub. That practical extension and the analysis of word concreteness versus abstractness are the parts that stand out as useful beyond the core method. The evaluations are described as both qualitative and quantitative with claims of better results than prior work. The approach itself is technically straightforward and does not appear circular. The soft spot is the evaluation. The abstract does not mention standard faithfulness tests such as insertion/deletion AUC or pointing games against human annotations, so the claim that these weighted maps are more faithful than attention-based alternatives stays unverified. If the full paper uses only custom or post-hoc metrics without controlled synthetic cases, that weakens the quantitative support. This is for researchers working on interpretability tools for vision-language models or on CLIP fine-tuning. A reader who needs a concrete way to generate region or word attributions could extract the weighting scheme or the alignment application. It deserves peer review because the idea is clear, the code is available, and the application shows engagement with a usable problem, even if referees will likely ask for tighter faithfulness metrics.

Referee Report

2 major / 2 minor

Summary. The paper proposes Grad-ECLIP, a gradient-based method for generating visual and textual explanations of CLIP image-text matching scores. It decomposes the CLIP encoder to relate similarity to intermediate spatial features, then applies channel and spatial weights derived from gradients to token features (instead of relying on sparse self-attention maps) to produce heatmaps; qualitative/quantitative evaluations claim superiority over prior methods, followed by analyses of CLIP's attribution behavior and an application that uses the maps to improve fine-grained alignment during CLIP fine-tuning. Code is released.

Significance. If the heatmaps can be shown to be faithful, the work would address an important gap in VLM interpretability by providing an alternative to attention-based explanations and could support better understanding and fine-tuning of CLIP-like models; the open-source code is a positive contribution to reproducibility.

major comments (2)

[§4] §4 (Quantitative Evaluation): The reported superiority over attention-based and other explanation methods rests on unspecified quantitative metrics; standard faithfulness protocols (insertion/deletion AUC, pointing game against human annotations, or controlled synthetic pairs) are not mentioned, leaving open whether the channel/spatial weighting isolates influential regions/words or produces artifacts.
[§3] §3 (Method): The central claim that gradient-derived channel and spatial weights on token features yield higher-quality explanations than sparse attention maps requires an ablation isolating the contribution of the weighting scheme versus raw gradients or attention; without it, the advantage over prior Transformer interpretation methods is not load-bearing.

minor comments (2)

[Abstract, §5] Abstract and §5: The application section claims the explanation maps boost fine-grained alignment, but the quantitative improvement (e.g., retrieval or alignment scores before/after) should be reported with error bars for clarity.
[§3] Notation: The decomposition relating matching similarity to intermediate features would benefit from an explicit equation showing how the final similarity score is expressed in terms of the weighted token features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We provide point-by-point responses to the major comments and indicate the revisions planned for the manuscript.

read point-by-point responses

Referee: [§4] §4 (Quantitative Evaluation): The reported superiority over attention-based and other explanation methods rests on unspecified quantitative metrics; standard faithfulness protocols (insertion/deletion AUC, pointing game against human annotations, or controlled synthetic pairs) are not mentioned, leaving open whether the channel/spatial weighting isolates influential regions/words or produces artifacts.

Authors: We agree that the quantitative metrics should be more explicitly described and that standard faithfulness protocols would strengthen the evaluation. In the revised manuscript, we will specify the metrics used in §4 and incorporate insertion/deletion AUC as well as pointing game evaluations against human annotations to better validate the faithfulness of the explanations. revision: yes
Referee: [§3] §3 (Method): The central claim that gradient-derived channel and spatial weights on token features yield higher-quality explanations than sparse attention maps requires an ablation isolating the contribution of the weighting scheme versus raw gradients or attention; without it, the advantage over prior Transformer interpretation methods is not load-bearing.

Authors: We recognize that an ablation study would help isolate the effect of the channel and spatial weighting. Although the current manuscript includes comparisons to attention-based methods, we will add a dedicated ablation in the revised version comparing Grad-ECLIP to variants using only raw gradients and only attention maps to substantiate the contribution of the weighting scheme. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained gradient-based method

full rationale

The paper's core derivation decomposes the CLIP encoder to relate similarity scores to intermediate spatial features and applies gradient-derived channel/spatial weights to token features. This is a direct methodological construction presented as an alternative to sparse attention maps, with no self-citations invoked as uniqueness theorems, no parameters fitted to data then relabeled as predictions, and no equations that reduce to their own inputs by definition. Evaluations are claimed to verify effectiveness independently of the method's construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method relies on standard gradient computation and feature weighting; no free parameters, axioms, or invented entities are introduced beyond the existing CLIP architecture and standard attribution techniques.

pith-pipeline@v0.9.0 · 5826 in / 1052 out tokens · 29832 ms · 2026-05-23T02:18:30.889292+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in ICML, 2021

2021
[2]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,

S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in CVPR, 2021, pp. 3558–3568

2021
[3]

Domain generalization by mutual- information regularization with pre-trained models,

J. Cha, K. Lee, S. Park, and S. Chun, “Domain generalization by mutual- information regularization with pre-trained models,” in ECCV, 2022

2022
[4]

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,

H. Luo, L. Ji, M. Zhong, Y . Chen, W. Lei, N. Duan, and T. Li, “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022

2022
[5]

Cris: Clip-driven referring image segmentation,

Z. Wang, Y . Lu, Q. Li, X. Tao, Y . Guo, M. Gong, and T. Liu, “Cris: Clip-driven referring image segmentation,” in CVPR, 2022

2022
[6]

Groupvit: Semantic segmentation emerges from text supervision,

J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in CVPR, 2022

2022
[7]

CoCa: Contrastive Captioners are Image-Text Foundation Models

J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation models,” arXiv:2205.01917, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022, pp. 12 888–12 900

2022
[9]

Learning to prompt for vision- language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,” IJCV, vol. 130, no. 9, pp. 2337–2348, 2022

2022
[10]

Prompt learning with optimal transport for vision-language models,

G. Chen, W. Yao, X. Song, X. Li, Y . Rao, and K. Zhang, “Prompt learning with optimal transport for vision-language models,” arXiv:2210.01253, 2022

work page arXiv 2022
[11]

Oscar: Object-semantics aligned pre-training for vision-language tasks,

X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV, 2020

2020
[12]

Position-guided text prompt for vision-language pre-training,

J. Wang, P. Zhou, M. Z. Shou, and S. Yan, “Position-guided text prompt for vision-language pre-training,” in CVPR, 2023, pp. 23 242–23 251

2023
[13]

Regionclip: Region-based language-image pretraining,

Y . Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y . Liet al., “Regionclip: Region-based language-image pretraining,” in CVPR, 2022, pp. 16 793–16 803

2022
[14]

Quantifying attention flow in transformers,

S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,” in ACL, 2020, pp. 4190–4197

2020
[15]

Transformer interpretability beyond attention visualization,

H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” in CVPR, 2021

2021
[16]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,

——, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” in ICCV, 2021, pp. 397–406

2021
[17]

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,

S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M ¨uller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PloS one , vol. 10(7):e0130140, 2015

2015
[18]

Clip surgery for better explainability with enhancement in open- vocabulary tasks

Y . Li, H. Wang, Y . Duan, and X. Li, “Clip surgery for better explainability with enhancement in open-vocabulary tasks,” arXiv:2304.05653, 2023

work page arXiv 2023
[19]

Extract free dense labels from clip,

C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in ECCV. Springer, 2022, pp. 696–712

2022
[20]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626

2017
[21]

RISE: Randomized Input Sampling for Explanation of Black-box Models

V . Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,” arXiv:1806.07421, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Visual explanations of image- text representations via multi-modal information bottleneck attribution,

Y . Wang, T. G. Rudner, and A. G. Wilson, “Visual explanations of image- text representations via multi-modal information bottleneck attribution,” NeurIPS, 2024

2024
[23]

Exploring vi- sual interpretability for contrastive language-image pre-training,

Y . Li, H. Wang, Y . Duan, H. Xu, and X. Li, “Exploring vi- sual interpretability for contrastive language-image pre-training,” arXiv:2209.07046, 2022

work page arXiv 2022
[24]

Visualizing and understanding convolu- tional networks,

M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- tional networks,” in ECCV, 2014, pp. 818–833

2014
[25]

Layer- cam: Exploring hierarchical class activation maps for localization,

P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layer- cam: Exploring hierarchical class activation maps for localization,” TIP, vol. 30, pp. 5875–5888, 2021

2021
[26]

Full-gradient representation for neural net- work visualization,

S. Srinivas and F. Fleuret, “Full-gradient representation for neural net- work visualization,” in NeurIPS, vol. 32, 2019

2019
[27]

Region-aware pretraining for open- vocabulary object detection with vision transformers,

D. Kim, A. Angelova, and W. Kuo, “Region-aware pretraining for open- vocabulary object detection with vision transformers,” in CVPR, 2023

2023
[28]

Clipself: Vision transformer distills itself for open-vocabulary dense prediction,

S. Wu, W. Zhang, L. Xu, S. Jin, X. Li, W. Liu, and C. C. Loy, “Clipself: Vision transformer distills itself for open-vocabulary dense prediction,” arXiv preprint arXiv:2310.01403, 2023

work page arXiv 2023
[29]

Gradient-based visual explanation for transformer-based clip,

C. Zhao, K. Wang, X. Zeng, R. Zhao, and A. B. Chan, “Gradient-based visual explanation for transformer-based clip,” in ICML, 2024

2024
[30]

Camp: Cross-modal adaptive message passing for text-image retrieval,

Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, and J. Shao, “Camp: Cross-modal adaptive message passing for text-image retrieval,” inICCV, 2019

2019
[31]

Show, attend and tell: Neural image caption generation with visual attention,

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y . Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015, pp. 2048–2057

2015
[32]

Vqa: Visual question answering,

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015

2015
[33]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in ICCV, 2015

2015
[34]

Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,

A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in WACV, 2018, pp. 839–847

2018
[35]

Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization,

H. G. Ramaswamy et al. , “Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization,” in WACV, 2020

2020
[36]

Score-cam: Score-weighted visual explanations for convolutional neural networks,

H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convolutional neural networks,” in CVPR Workshops, 2020, pp. 24–25

2020
[37]

Ss-cam: Smoothed score-cam for sharper visual feature localization,

H. Wang, R. Naidu, J. Michael, and S. S. Kundu, “Ss-cam: Smoothed score-cam for sharper visual feature localization,” arXiv:2006.14255, 2020

work page arXiv 2006
[38]

” why should i trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” in ACM SIGKDD, 2016

2016
[39]

Interpretable explanations of black boxes by meaningful perturbation,

R. C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” in ICCV, 2017, pp. 3429–3437

2017
[40]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” NeurIPS, vol. 30, 2017

2017
[41]

Interpretable and fine-grained visual explanations for convo- lutional neural networks,

J. Wagner, J. M. Kohler, T. Gindele, L. Hetzel, J. T. Wiedemer, and S. Behnke, “Interpretable and fine-grained visual explanations for convo- lutional neural networks,” in CVPR, 2019, pp. 9097–9107

2019
[42]

Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation,

J. Lee, J. Yi, C. Shin, and S. Yoon, “Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation,” in CVPR, 2021

2021
[43]

Black-box explanation of object detectors via saliency maps,

V . Petsiuk, R. Jain, V . Manjunatha, V . I. Morariu, A. Mehra, V . Ordonez, and K. Saenko, “Black-box explanation of object detectors via saliency maps,” in CVPR, 2021, pp. 11 443–11 452

2021
[44]

Explaining nonlinear classification decisions with deep taylor decom- position,

G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. M ¨uller, “Explaining nonlinear classification decisions with deep taylor decom- position,” Pattern recognition, vol. 65, pp. 211–222, 2017

2017
[45]

Relative attributing propagation: Interpreting the comparative contributions of individual units in deep neural networks,

W.-J. Nam, S. Gur, J. Choi, L. Wolf, and S.-W. Lee, “Relative attributing propagation: Interpreting the comparative contributions of individual units in deep neural networks,” in AAAI, 2020

2020
[46]

Learning important fea- tures through propagating activation differences,

A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important fea- tures through propagating activation differences,” in ICML, 2017

2017
[47]

Understanding individual decisions of cnns via contrastive backpropagation,

J. Gu, Y . Yang, and V . Tresp, “Understanding individual decisions of cnns via contrastive backpropagation,” in ACCV, 2019, pp. 119–134

2019
[48]

Explaining convolutional neural networks using softmax gradient layer-wise relevance propagation,

B. K. Iwana, R. Kuroki, and S. Uchida, “Explaining convolutional neural networks using softmax gradient layer-wise relevance propagation,” in ICCVW, 2019

2019
[49]

Attcat: Explaining transformers via attentive class activation tokens,

Y . Qiang, D. Pan, C. Li, X. Li, R. Jang, and D. Zhu, “Attcat: Explaining transformers via attentive class activation tokens,” NeurIPS, 2022

2022
[50]

Vit-cx: Causal explana- tion of vision transformers,

W. Xie, X.-H. Li, C. C. Cao, and N. L. Zhang, “Vit-cx: Causal explana- tion of vision transformers,” pp. 1569–1577, 2023

2023
[51]

X-pruner: explainable pruning for vision trans- formers,

L. Yu and W. Xiang, “X-pruner: explainable pruning for vision trans- formers,” in CVPR, 2023, pp. 24 355–24 363

2023
[52]

A simple baseline for zeroshot semantic segmentation with pre-trained vision- language model,

M. Xu, Z. Zhang, F. Wei, Y . Lin, Y . Cao, H. Hu, and X. Bai, “A simple baseline for zeroshot semantic segmentation with pre-trained vision- language model,” ECCV, 2022

2022
[53]

Open-vocabulary semantic segmentation with mask- adapted clip,

F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask- adapted clip,” in CVPR, 2023, pp. 7061–7070

2023
[54]

Open-vocabulary object detection via vision and language knowledge distillation,

X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” ICLR, 2022

2022
[55]

Learning to prompt for open-vocabulary object detection with vision-language model,

Y . Du, F. Wei, Z. Zhang, M. Shi, Y . Gao, and G. Li, “Learning to prompt for open-vocabulary object detection with vision-language model,” in CVPR, 2022, pp. 14 084–14 093

2022
[56]

F-vlm: Open- vocabulary object detection upon frozen vision and language models,

W. Kuo, Y . Cui, X. Gu, A. Piergiovanni, and A. Angelova, “F-vlm: Open- vocabulary object detection upon frozen vision and language models,” in ICLR, 2023

2023
[57]

Cora: Adapting clip for open- vocabulary detection with region prompting and anchor pre-matching,

X. Wu, F. Zhu, R. Zhao, and H. Li, “Cora: Adapting clip for open- vocabulary detection with region prompting and anchor pre-matching,” JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXX XXXX 18 in CVPR, 2023, pp. 7031–7040

2023
[58]

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,

Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” NeurIPS, vol. 36, 2024

2024
[59]

Grounded language-image pre-training,

L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in CVPR, 2022, pp. 10 965–10 975

2022
[60]

Uniter: Universal image-text representation learning,

Y .-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y . Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in ECCV, 2020

2020
[61]

Vinvl: Revisiting visual representations in vision-language models,

P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y . Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in CVPR, 2021, pp. 5579–5588

2021
[62]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Visual genome: Connecting language and vision using crowdsourced dense image annotations,

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shammaet al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,”IJCV, vol. 123, pp. 32–73, 2017

2017
[64]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” NeurIPS, vol. 28, 2015

2015
[65]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778

2016
[66]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, pp. 211–252, 2015

2015
[67]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014

2014
[68]

The many faces of robustness: A critical analysis of out-of-distribution generalization,

D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al. , “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in ICCV, 2021, pp. 8340–8349

2021
[69]

Learning robust global representations by penalizing local predictive power,

H. Wang, S. Ge, Z. Lipton, and E. P. Xing, “Learning robust global representations by penalizing local predictive power,” NeurIPS, 2019

2019
[70]

Natural adversarial examples,

D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in CVPR, 2021, pp. 15 262–15 271

2021
[71]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,

P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in ACL, 2018, pp. 2556–2565

2018
[72]

Making the most of text semantics to improve biomedical vision– language processing,

B. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle et al., “Making the most of text semantics to improve biomedical vision– language processing,” in ECCV, 2022

2022
[73]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al. , “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[74]

Align before fuse: Vision and language representation learning with momentum distillation,

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” NeurIPS, vol. 34, pp. 9694–9705, 2021

2021
[75]

Evaluating the visualization of what a deep neural network has learned,

W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. M ¨uller, “Evaluating the visualization of what a deep neural network has learned,” IEEE Trans. Neural Netw Learn Syst , vol. 28(11), pp. 2660–73, 2016

2016
[76]

Top- down neural attention by excitation backprop,

J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top- down neural attention by excitation backprop,” IJCV, vol. 126(10), pp. 1084–102, 2018

2018
[77]

Odam: Gradient-based instance-specific visual explanation for object detection,

C. Zhao and A. B. Chan, “Odam: Gradient-based instance-specific visual explanation for object detection,” in ICLR, 2022

2022
[78]

Large- scale unsupervised semantic segmentation,

S. Gao, Z.-Y . Li, M.-H. Yang, M.-M. Cheng, J. Han, and P. Torr, “Large- scale unsupervised semantic segmentation,” TPAMI, 2022

2022
[79]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,

J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in CVPR, 2017

2017
[80]

Concreteness ratings for 40 thousand generally known english word lemmas,

M. Brysbaert, A. B. Warriner, and V . Kuperman, “Concreteness ratings for 40 thousand generally known english word lemmas,” Behavior research methods, vol. 46, pp. 904–911, 2014

2014

Showing first 80 references.

[1] [1]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in ICML, 2021

2021

[2] [2]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,

S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in CVPR, 2021, pp. 3558–3568

2021

[3] [3]

Domain generalization by mutual- information regularization with pre-trained models,

J. Cha, K. Lee, S. Park, and S. Chun, “Domain generalization by mutual- information regularization with pre-trained models,” in ECCV, 2022

2022

[4] [4]

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,

H. Luo, L. Ji, M. Zhong, Y . Chen, W. Lei, N. Duan, and T. Li, “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022

2022

[5] [5]

Cris: Clip-driven referring image segmentation,

Z. Wang, Y . Lu, Q. Li, X. Tao, Y . Guo, M. Gong, and T. Liu, “Cris: Clip-driven referring image segmentation,” in CVPR, 2022

2022

[6] [6]

Groupvit: Semantic segmentation emerges from text supervision,

J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in CVPR, 2022

2022

[7] [7]

CoCa: Contrastive Captioners are Image-Text Foundation Models

J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation models,” arXiv:2205.01917, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022, pp. 12 888–12 900

2022

[9] [9]

Learning to prompt for vision- language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,” IJCV, vol. 130, no. 9, pp. 2337–2348, 2022

2022

[10] [10]

Prompt learning with optimal transport for vision-language models,

G. Chen, W. Yao, X. Song, X. Li, Y . Rao, and K. Zhang, “Prompt learning with optimal transport for vision-language models,” arXiv:2210.01253, 2022

work page arXiv 2022

[11] [11]

Oscar: Object-semantics aligned pre-training for vision-language tasks,

X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV, 2020

2020

[12] [12]

Position-guided text prompt for vision-language pre-training,

J. Wang, P. Zhou, M. Z. Shou, and S. Yan, “Position-guided text prompt for vision-language pre-training,” in CVPR, 2023, pp. 23 242–23 251

2023

[13] [13]

Regionclip: Region-based language-image pretraining,

Y . Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y . Liet al., “Regionclip: Region-based language-image pretraining,” in CVPR, 2022, pp. 16 793–16 803

2022

[14] [14]

Quantifying attention flow in transformers,

S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,” in ACL, 2020, pp. 4190–4197

2020

[15] [15]

Transformer interpretability beyond attention visualization,

H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” in CVPR, 2021

2021

[16] [16]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,

——, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” in ICCV, 2021, pp. 397–406

2021

[17] [17]

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,

S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M ¨uller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PloS one , vol. 10(7):e0130140, 2015

2015

[18] [18]

Clip surgery for better explainability with enhancement in open- vocabulary tasks

Y . Li, H. Wang, Y . Duan, and X. Li, “Clip surgery for better explainability with enhancement in open-vocabulary tasks,” arXiv:2304.05653, 2023

work page arXiv 2023

[19] [19]

Extract free dense labels from clip,

C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in ECCV. Springer, 2022, pp. 696–712

2022

[20] [20]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626

2017

[21] [21]

RISE: Randomized Input Sampling for Explanation of Black-box Models

V . Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,” arXiv:1806.07421, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

Visual explanations of image- text representations via multi-modal information bottleneck attribution,

Y . Wang, T. G. Rudner, and A. G. Wilson, “Visual explanations of image- text representations via multi-modal information bottleneck attribution,” NeurIPS, 2024

2024

[23] [23]

Exploring vi- sual interpretability for contrastive language-image pre-training,

Y . Li, H. Wang, Y . Duan, H. Xu, and X. Li, “Exploring vi- sual interpretability for contrastive language-image pre-training,” arXiv:2209.07046, 2022

work page arXiv 2022

[24] [24]

Visualizing and understanding convolu- tional networks,

M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- tional networks,” in ECCV, 2014, pp. 818–833

2014

[25] [25]

Layer- cam: Exploring hierarchical class activation maps for localization,

P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layer- cam: Exploring hierarchical class activation maps for localization,” TIP, vol. 30, pp. 5875–5888, 2021

2021

[26] [26]

Full-gradient representation for neural net- work visualization,

S. Srinivas and F. Fleuret, “Full-gradient representation for neural net- work visualization,” in NeurIPS, vol. 32, 2019

2019

[27] [27]

Region-aware pretraining for open- vocabulary object detection with vision transformers,

D. Kim, A. Angelova, and W. Kuo, “Region-aware pretraining for open- vocabulary object detection with vision transformers,” in CVPR, 2023

2023

[28] [28]

Clipself: Vision transformer distills itself for open-vocabulary dense prediction,

S. Wu, W. Zhang, L. Xu, S. Jin, X. Li, W. Liu, and C. C. Loy, “Clipself: Vision transformer distills itself for open-vocabulary dense prediction,” arXiv preprint arXiv:2310.01403, 2023

work page arXiv 2023

[29] [29]

Gradient-based visual explanation for transformer-based clip,

C. Zhao, K. Wang, X. Zeng, R. Zhao, and A. B. Chan, “Gradient-based visual explanation for transformer-based clip,” in ICML, 2024

2024

[30] [30]

Camp: Cross-modal adaptive message passing for text-image retrieval,

Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, and J. Shao, “Camp: Cross-modal adaptive message passing for text-image retrieval,” inICCV, 2019

2019

[31] [31]

Show, attend and tell: Neural image caption generation with visual attention,

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y . Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015, pp. 2048–2057

2015

[32] [32]

Vqa: Visual question answering,

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015

2015

[33] [33]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in ICCV, 2015

2015

[34] [34]

Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,

A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in WACV, 2018, pp. 839–847

2018

[35] [35]

Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization,

H. G. Ramaswamy et al. , “Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization,” in WACV, 2020

2020

[36] [36]

Score-cam: Score-weighted visual explanations for convolutional neural networks,

H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convolutional neural networks,” in CVPR Workshops, 2020, pp. 24–25

2020

[37] [37]

Ss-cam: Smoothed score-cam for sharper visual feature localization,

H. Wang, R. Naidu, J. Michael, and S. S. Kundu, “Ss-cam: Smoothed score-cam for sharper visual feature localization,” arXiv:2006.14255, 2020

work page arXiv 2006

[38] [38]

” why should i trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” in ACM SIGKDD, 2016

2016

[39] [39]

Interpretable explanations of black boxes by meaningful perturbation,

R. C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” in ICCV, 2017, pp. 3429–3437

2017

[40] [40]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” NeurIPS, vol. 30, 2017

2017

[41] [41]

Interpretable and fine-grained visual explanations for convo- lutional neural networks,

J. Wagner, J. M. Kohler, T. Gindele, L. Hetzel, J. T. Wiedemer, and S. Behnke, “Interpretable and fine-grained visual explanations for convo- lutional neural networks,” in CVPR, 2019, pp. 9097–9107

2019

[42] [42]

Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation,

J. Lee, J. Yi, C. Shin, and S. Yoon, “Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation,” in CVPR, 2021

2021

[43] [43]

Black-box explanation of object detectors via saliency maps,

V . Petsiuk, R. Jain, V . Manjunatha, V . I. Morariu, A. Mehra, V . Ordonez, and K. Saenko, “Black-box explanation of object detectors via saliency maps,” in CVPR, 2021, pp. 11 443–11 452

2021

[44] [44]

Explaining nonlinear classification decisions with deep taylor decom- position,

G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. M ¨uller, “Explaining nonlinear classification decisions with deep taylor decom- position,” Pattern recognition, vol. 65, pp. 211–222, 2017

2017

[45] [45]

Relative attributing propagation: Interpreting the comparative contributions of individual units in deep neural networks,

W.-J. Nam, S. Gur, J. Choi, L. Wolf, and S.-W. Lee, “Relative attributing propagation: Interpreting the comparative contributions of individual units in deep neural networks,” in AAAI, 2020

2020

[46] [46]

Learning important fea- tures through propagating activation differences,

A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important fea- tures through propagating activation differences,” in ICML, 2017

2017

[47] [47]

Understanding individual decisions of cnns via contrastive backpropagation,

J. Gu, Y . Yang, and V . Tresp, “Understanding individual decisions of cnns via contrastive backpropagation,” in ACCV, 2019, pp. 119–134

2019

[48] [48]

Explaining convolutional neural networks using softmax gradient layer-wise relevance propagation,

B. K. Iwana, R. Kuroki, and S. Uchida, “Explaining convolutional neural networks using softmax gradient layer-wise relevance propagation,” in ICCVW, 2019

2019

[49] [49]

Attcat: Explaining transformers via attentive class activation tokens,

Y . Qiang, D. Pan, C. Li, X. Li, R. Jang, and D. Zhu, “Attcat: Explaining transformers via attentive class activation tokens,” NeurIPS, 2022

2022

[50] [50]

Vit-cx: Causal explana- tion of vision transformers,

W. Xie, X.-H. Li, C. C. Cao, and N. L. Zhang, “Vit-cx: Causal explana- tion of vision transformers,” pp. 1569–1577, 2023

2023

[51] [51]

X-pruner: explainable pruning for vision trans- formers,

L. Yu and W. Xiang, “X-pruner: explainable pruning for vision trans- formers,” in CVPR, 2023, pp. 24 355–24 363

2023

[52] [52]

A simple baseline for zeroshot semantic segmentation with pre-trained vision- language model,

M. Xu, Z. Zhang, F. Wei, Y . Lin, Y . Cao, H. Hu, and X. Bai, “A simple baseline for zeroshot semantic segmentation with pre-trained vision- language model,” ECCV, 2022

2022

[53] [53]

Open-vocabulary semantic segmentation with mask- adapted clip,

F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask- adapted clip,” in CVPR, 2023, pp. 7061–7070

2023

[54] [54]

Open-vocabulary object detection via vision and language knowledge distillation,

X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” ICLR, 2022

2022

[55] [55]

Learning to prompt for open-vocabulary object detection with vision-language model,

Y . Du, F. Wei, Z. Zhang, M. Shi, Y . Gao, and G. Li, “Learning to prompt for open-vocabulary object detection with vision-language model,” in CVPR, 2022, pp. 14 084–14 093

2022

[56] [56]

F-vlm: Open- vocabulary object detection upon frozen vision and language models,

W. Kuo, Y . Cui, X. Gu, A. Piergiovanni, and A. Angelova, “F-vlm: Open- vocabulary object detection upon frozen vision and language models,” in ICLR, 2023

2023

[57] [57]

Cora: Adapting clip for open- vocabulary detection with region prompting and anchor pre-matching,

X. Wu, F. Zhu, R. Zhao, and H. Li, “Cora: Adapting clip for open- vocabulary detection with region prompting and anchor pre-matching,” JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXX XXXX 18 in CVPR, 2023, pp. 7031–7040

2023

[58] [58]

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,

Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” NeurIPS, vol. 36, 2024

2024

[59] [59]

Grounded language-image pre-training,

L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in CVPR, 2022, pp. 10 965–10 975

2022

[60] [60]

Uniter: Universal image-text representation learning,

Y .-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y . Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in ECCV, 2020

2020

[61] [61]

Vinvl: Revisiting visual representations in vision-language models,

P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y . Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in CVPR, 2021, pp. 5579–5588

2021

[62] [62]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [63]

Visual genome: Connecting language and vision using crowdsourced dense image annotations,

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shammaet al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,”IJCV, vol. 123, pp. 32–73, 2017

2017

[64] [64]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” NeurIPS, vol. 28, 2015

2015

[65] [65]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778

2016

[66] [66]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, pp. 211–252, 2015

2015

[67] [67]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014

2014

[68] [68]

The many faces of robustness: A critical analysis of out-of-distribution generalization,

D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al. , “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in ICCV, 2021, pp. 8340–8349

2021

[69] [69]

Learning robust global representations by penalizing local predictive power,

H. Wang, S. Ge, Z. Lipton, and E. P. Xing, “Learning robust global representations by penalizing local predictive power,” NeurIPS, 2019

2019

[70] [70]

Natural adversarial examples,

D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in CVPR, 2021, pp. 15 262–15 271

2021

[71] [71]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,

P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in ACL, 2018, pp. 2556–2565

2018

[72] [72]

Making the most of text semantics to improve biomedical vision– language processing,

B. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle et al., “Making the most of text semantics to improve biomedical vision– language processing,” in ECCV, 2022

2022

[73] [73]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al. , “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[74] [74]

Align before fuse: Vision and language representation learning with momentum distillation,

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” NeurIPS, vol. 34, pp. 9694–9705, 2021

2021

[75] [75]

Evaluating the visualization of what a deep neural network has learned,

W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. M ¨uller, “Evaluating the visualization of what a deep neural network has learned,” IEEE Trans. Neural Netw Learn Syst , vol. 28(11), pp. 2660–73, 2016

2016

[76] [76]

Top- down neural attention by excitation backprop,

J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top- down neural attention by excitation backprop,” IJCV, vol. 126(10), pp. 1084–102, 2018

2018

[77] [77]

Odam: Gradient-based instance-specific visual explanation for object detection,

C. Zhao and A. B. Chan, “Odam: Gradient-based instance-specific visual explanation for object detection,” in ICLR, 2022

2022

[78] [78]

Large- scale unsupervised semantic segmentation,

S. Gao, Z.-Y . Li, M.-H. Yang, M.-M. Cheng, J. Han, and P. Torr, “Large- scale unsupervised semantic segmentation,” TPAMI, 2022

2022

[79] [79]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,

J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in CVPR, 2017

2017

[80] [80]

Concreteness ratings for 40 thousand generally known english word lemmas,

M. Brysbaert, A. B. Warriner, and V . Kuperman, “Concreteness ratings for 40 thousand generally known english word lemmas,” Behavior research methods, vol. 46, pp. 904–911, 2014

2014