pith. sign in

arxiv: 2502.18816 · v2 · submitted 2025-02-26 · 💻 cs.CV

Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP

Pith reviewed 2026-05-23 02:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords CLIPinterpretabilitygradient explanationheat mapsvision language modelfine-tuningattribution
0
0 comments X

The pith

Grad-ECLIP generates heat maps for CLIP by weighting token features with gradient-derived channel and spatial weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Grad-ECLIP to interpret CLIP's image-text matching decisions. It decomposes the encoders to connect similarity scores to intermediate features and applies gradient-based weights to produce heat maps for images and text. This differs from attention-map methods because CLIP's attention is sparse. Evaluations show it produces higher quality explanations than prior techniques. The maps also support analysis of CLIP and an application to improve fine-grained alignment during fine-tuning.

Core claim

By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results by applying channel and spatial weights on token features.

What carries the argument

Gradient-derived channel and spatial weights applied to token features in CLIP encoders to generate visual and textual explanation heat maps.

If this is right

  • Reveals the working mechanism of image-text matching in CLIP.
  • Identifies strengths and limitations in CLIP's attribution identification.
  • Explores the relationship between a word's concreteness and its usage in CLIP.
  • Boosts fine-grained alignment in CLIP fine-tuning using text-specific saliency regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar gradient weighting could improve explanations in other multimodal models.
  • The method might enable better debugging of vision-language models.
  • Explanations could be integrated into training to enforce better alignment automatically.

Load-bearing premise

That applying gradient-based channel and spatial weights to token features will yield more faithful and higher-quality explanations than attention maps despite their sparsity in CLIP.

What would settle it

If quantitative metrics like insertion or deletion scores or human agreement show no improvement over state-of-the-art attention-based methods on CLIP explanation tasks.

Figures

Figures reproduced from arXiv: 2502.18816 by Antoni B. Chan, Chenyang Zhao, Janet H. Hsiao, Kun Wang.

Figure 1
Figure 1. Figure 1: Visual and textual explanations of CLIP for the image with the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of Grad-ECLIP. An image-text pair specific visual [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of heat maps from: (a) the raw self-attention map in the last ViT layer; (b) Rollout [14]; (c) Grad-CAM [20]; (d) GAME [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Explanations for image-text pairs from MS COCO using: by (a) raw self-attention; Transformer interpretation methods (b) Rollout, (c) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of visual explanations for different methods on image samples from different image domains: natural images (ImageNet), [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Grad-ECLIP visual explanations of the ViT classifier and BLIP. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual explanations of the CNN-based CLIP with ResNet50 [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Classification accuracy at Top-1 vs. (a) Deletion steps [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: The image visual explanations generated when aggregating over N layers of the image transformer encoder. 𝑁𝑁= 1 𝑁𝑁= 2 𝑁𝑁= 4 𝑁𝑁= 6 𝑁𝑁= 8 𝑁𝑁= 12 𝑁𝑁= 10 [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visual explanation heat maps generated for single words and [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visual explanations on image matching with different kinds [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The average word importance (via Grad-ECLIP) vs. word [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The average word importance (via Grad-ECLIP) vs. the word frequency in the OpenWordText dataset for the top-1000 most frequent [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Overview of the proposed fine-grained fine-tuning of CLIP [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗
read the original abstract

Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the effectiveness and superiority of Grad-ECLIP compared with the state-of-the-art methods. Furthermore, a series of analysis are conducted based on our visual and textual explanation results, from which we explore the working mechanism of image-text matching, the strengths and limitations in attribution identification of CLIP, and the relationship between the concreteness/abstractness of a word and its usage in CLIP. Finally, based on the ability of explanation map that indicates text-specific saliency region of input image, we also propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning. The code of Grad-ECLIP is available here: https://github.com/Cyang-Zhao/Grad-Eclip.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Grad-ECLIP, a gradient-based method for generating visual and textual explanations of CLIP image-text matching scores. It decomposes the CLIP encoder to relate similarity to intermediate spatial features, then applies channel and spatial weights derived from gradients to token features (instead of relying on sparse self-attention maps) to produce heatmaps; qualitative/quantitative evaluations claim superiority over prior methods, followed by analyses of CLIP's attribution behavior and an application that uses the maps to improve fine-grained alignment during CLIP fine-tuning. Code is released.

Significance. If the heatmaps can be shown to be faithful, the work would address an important gap in VLM interpretability by providing an alternative to attention-based explanations and could support better understanding and fine-tuning of CLIP-like models; the open-source code is a positive contribution to reproducibility.

major comments (2)
  1. [§4] §4 (Quantitative Evaluation): The reported superiority over attention-based and other explanation methods rests on unspecified quantitative metrics; standard faithfulness protocols (insertion/deletion AUC, pointing game against human annotations, or controlled synthetic pairs) are not mentioned, leaving open whether the channel/spatial weighting isolates influential regions/words or produces artifacts.
  2. [§3] §3 (Method): The central claim that gradient-derived channel and spatial weights on token features yield higher-quality explanations than sparse attention maps requires an ablation isolating the contribution of the weighting scheme versus raw gradients or attention; without it, the advantage over prior Transformer interpretation methods is not load-bearing.
minor comments (2)
  1. [Abstract, §5] Abstract and §5: The application section claims the explanation maps boost fine-grained alignment, but the quantitative improvement (e.g., retrieval or alignment scores before/after) should be reported with error bars for clarity.
  2. [§3] Notation: The decomposition relating matching similarity to intermediate features would benefit from an explicit equation showing how the final similarity score is expressed in terms of the weighted token features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We provide point-by-point responses to the major comments and indicate the revisions planned for the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Quantitative Evaluation): The reported superiority over attention-based and other explanation methods rests on unspecified quantitative metrics; standard faithfulness protocols (insertion/deletion AUC, pointing game against human annotations, or controlled synthetic pairs) are not mentioned, leaving open whether the channel/spatial weighting isolates influential regions/words or produces artifacts.

    Authors: We agree that the quantitative metrics should be more explicitly described and that standard faithfulness protocols would strengthen the evaluation. In the revised manuscript, we will specify the metrics used in §4 and incorporate insertion/deletion AUC as well as pointing game evaluations against human annotations to better validate the faithfulness of the explanations. revision: yes

  2. Referee: [§3] §3 (Method): The central claim that gradient-derived channel and spatial weights on token features yield higher-quality explanations than sparse attention maps requires an ablation isolating the contribution of the weighting scheme versus raw gradients or attention; without it, the advantage over prior Transformer interpretation methods is not load-bearing.

    Authors: We recognize that an ablation study would help isolate the effect of the channel and spatial weighting. Although the current manuscript includes comparisons to attention-based methods, we will add a dedicated ablation in the revised version comparing Grad-ECLIP to variants using only raw gradients and only attention maps to substantiate the contribution of the weighting scheme. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained gradient-based method

full rationale

The paper's core derivation decomposes the CLIP encoder to relate similarity scores to intermediate spatial features and applies gradient-derived channel/spatial weights to token features. This is a direct methodological construction presented as an alternative to sparse attention maps, with no self-citations invoked as uniqueness theorems, no parameters fitted to data then relabeled as predictions, and no equations that reduce to their own inputs by definition. Evaluations are claimed to verify effectiveness independently of the method's construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method relies on standard gradient computation and feature weighting; no free parameters, axioms, or invented entities are introduced beyond the existing CLIP architecture and standard attribution techniques.

pith-pipeline@v0.9.0 · 5826 in / 1052 out tokens · 29832 ms · 2026-05-23T02:18:30.889292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in ICML, 2021

  2. [2]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,

    S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in CVPR, 2021, pp. 3558–3568

  3. [3]

    Domain generalization by mutual- information regularization with pre-trained models,

    J. Cha, K. Lee, S. Park, and S. Chun, “Domain generalization by mutual- information regularization with pre-trained models,” in ECCV, 2022

  4. [4]

    Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,

    H. Luo, L. Ji, M. Zhong, Y . Chen, W. Lei, N. Duan, and T. Li, “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022

  5. [5]

    Cris: Clip-driven referring image segmentation,

    Z. Wang, Y . Lu, Q. Li, X. Tao, Y . Guo, M. Gong, and T. Liu, “Cris: Clip-driven referring image segmentation,” in CVPR, 2022

  6. [6]

    Groupvit: Semantic segmentation emerges from text supervision,

    J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in CVPR, 2022

  7. [7]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation models,” arXiv:2205.01917, 2022

  8. [8]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022, pp. 12 888–12 900

  9. [9]

    Learning to prompt for vision- language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,” IJCV, vol. 130, no. 9, pp. 2337–2348, 2022

  10. [10]

    Prompt learning with optimal transport for vision-language models,

    G. Chen, W. Yao, X. Song, X. Li, Y . Rao, and K. Zhang, “Prompt learning with optimal transport for vision-language models,” arXiv:2210.01253, 2022

  11. [11]

    Oscar: Object-semantics aligned pre-training for vision-language tasks,

    X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV, 2020

  12. [12]

    Position-guided text prompt for vision-language pre-training,

    J. Wang, P. Zhou, M. Z. Shou, and S. Yan, “Position-guided text prompt for vision-language pre-training,” in CVPR, 2023, pp. 23 242–23 251

  13. [13]

    Regionclip: Region-based language-image pretraining,

    Y . Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y . Liet al., “Regionclip: Region-based language-image pretraining,” in CVPR, 2022, pp. 16 793–16 803

  14. [14]

    Quantifying attention flow in transformers,

    S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,” in ACL, 2020, pp. 4190–4197

  15. [15]

    Transformer interpretability beyond attention visualization,

    H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” in CVPR, 2021

  16. [16]

    Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,

    ——, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” in ICCV, 2021, pp. 397–406

  17. [17]

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,

    S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M ¨uller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PloS one , vol. 10(7):e0130140, 2015

  18. [18]

    Clip surgery for better explainability with enhancement in open- vocabulary tasks

    Y . Li, H. Wang, Y . Duan, and X. Li, “Clip surgery for better explainability with enhancement in open-vocabulary tasks,” arXiv:2304.05653, 2023

  19. [19]

    Extract free dense labels from clip,

    C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in ECCV. Springer, 2022, pp. 696–712

  20. [20]

    Grad-cam: Visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626

  21. [21]

    RISE: Randomized Input Sampling for Explanation of Black-box Models

    V . Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,” arXiv:1806.07421, 2018

  22. [22]

    Visual explanations of image- text representations via multi-modal information bottleneck attribution,

    Y . Wang, T. G. Rudner, and A. G. Wilson, “Visual explanations of image- text representations via multi-modal information bottleneck attribution,” NeurIPS, 2024

  23. [23]

    Exploring vi- sual interpretability for contrastive language-image pre-training,

    Y . Li, H. Wang, Y . Duan, H. Xu, and X. Li, “Exploring vi- sual interpretability for contrastive language-image pre-training,” arXiv:2209.07046, 2022

  24. [24]

    Visualizing and understanding convolu- tional networks,

    M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- tional networks,” in ECCV, 2014, pp. 818–833

  25. [25]

    Layer- cam: Exploring hierarchical class activation maps for localization,

    P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layer- cam: Exploring hierarchical class activation maps for localization,” TIP, vol. 30, pp. 5875–5888, 2021

  26. [26]

    Full-gradient representation for neural net- work visualization,

    S. Srinivas and F. Fleuret, “Full-gradient representation for neural net- work visualization,” in NeurIPS, vol. 32, 2019

  27. [27]

    Region-aware pretraining for open- vocabulary object detection with vision transformers,

    D. Kim, A. Angelova, and W. Kuo, “Region-aware pretraining for open- vocabulary object detection with vision transformers,” in CVPR, 2023

  28. [28]

    Clipself: Vision transformer distills itself for open-vocabulary dense prediction,

    S. Wu, W. Zhang, L. Xu, S. Jin, X. Li, W. Liu, and C. C. Loy, “Clipself: Vision transformer distills itself for open-vocabulary dense prediction,” arXiv preprint arXiv:2310.01403, 2023

  29. [29]

    Gradient-based visual explanation for transformer-based clip,

    C. Zhao, K. Wang, X. Zeng, R. Zhao, and A. B. Chan, “Gradient-based visual explanation for transformer-based clip,” in ICML, 2024

  30. [30]

    Camp: Cross-modal adaptive message passing for text-image retrieval,

    Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, and J. Shao, “Camp: Cross-modal adaptive message passing for text-image retrieval,” inICCV, 2019

  31. [31]

    Show, attend and tell: Neural image caption generation with visual attention,

    K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y . Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015, pp. 2048–2057

  32. [32]

    Vqa: Visual question answering,

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015

  33. [33]

    Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,

    B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in ICCV, 2015

  34. [34]

    Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,

    A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in WACV, 2018, pp. 839–847

  35. [35]

    Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization,

    H. G. Ramaswamy et al. , “Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization,” in WACV, 2020

  36. [36]

    Score-cam: Score-weighted visual explanations for convolutional neural networks,

    H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convolutional neural networks,” in CVPR Workshops, 2020, pp. 24–25

  37. [37]

    Ss-cam: Smoothed score-cam for sharper visual feature localization,

    H. Wang, R. Naidu, J. Michael, and S. S. Kundu, “Ss-cam: Smoothed score-cam for sharper visual feature localization,” arXiv:2006.14255, 2020

  38. [38]

    ” why should i trust you?

    M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” in ACM SIGKDD, 2016

  39. [39]

    Interpretable explanations of black boxes by meaningful perturbation,

    R. C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” in ICCV, 2017, pp. 3429–3437

  40. [40]

    A unified approach to interpreting model predictions,

    S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” NeurIPS, vol. 30, 2017

  41. [41]

    Interpretable and fine-grained visual explanations for convo- lutional neural networks,

    J. Wagner, J. M. Kohler, T. Gindele, L. Hetzel, J. T. Wiedemer, and S. Behnke, “Interpretable and fine-grained visual explanations for convo- lutional neural networks,” in CVPR, 2019, pp. 9097–9107

  42. [42]

    Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation,

    J. Lee, J. Yi, C. Shin, and S. Yoon, “Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation,” in CVPR, 2021

  43. [43]

    Black-box explanation of object detectors via saliency maps,

    V . Petsiuk, R. Jain, V . Manjunatha, V . I. Morariu, A. Mehra, V . Ordonez, and K. Saenko, “Black-box explanation of object detectors via saliency maps,” in CVPR, 2021, pp. 11 443–11 452

  44. [44]

    Explaining nonlinear classification decisions with deep taylor decom- position,

    G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. M ¨uller, “Explaining nonlinear classification decisions with deep taylor decom- position,” Pattern recognition, vol. 65, pp. 211–222, 2017

  45. [45]

    Relative attributing propagation: Interpreting the comparative contributions of individual units in deep neural networks,

    W.-J. Nam, S. Gur, J. Choi, L. Wolf, and S.-W. Lee, “Relative attributing propagation: Interpreting the comparative contributions of individual units in deep neural networks,” in AAAI, 2020

  46. [46]

    Learning important fea- tures through propagating activation differences,

    A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important fea- tures through propagating activation differences,” in ICML, 2017

  47. [47]

    Understanding individual decisions of cnns via contrastive backpropagation,

    J. Gu, Y . Yang, and V . Tresp, “Understanding individual decisions of cnns via contrastive backpropagation,” in ACCV, 2019, pp. 119–134

  48. [48]

    Explaining convolutional neural networks using softmax gradient layer-wise relevance propagation,

    B. K. Iwana, R. Kuroki, and S. Uchida, “Explaining convolutional neural networks using softmax gradient layer-wise relevance propagation,” in ICCVW, 2019

  49. [49]

    Attcat: Explaining transformers via attentive class activation tokens,

    Y . Qiang, D. Pan, C. Li, X. Li, R. Jang, and D. Zhu, “Attcat: Explaining transformers via attentive class activation tokens,” NeurIPS, 2022

  50. [50]

    Vit-cx: Causal explana- tion of vision transformers,

    W. Xie, X.-H. Li, C. C. Cao, and N. L. Zhang, “Vit-cx: Causal explana- tion of vision transformers,” pp. 1569–1577, 2023

  51. [51]

    X-pruner: explainable pruning for vision trans- formers,

    L. Yu and W. Xiang, “X-pruner: explainable pruning for vision trans- formers,” in CVPR, 2023, pp. 24 355–24 363

  52. [52]

    A simple baseline for zeroshot semantic segmentation with pre-trained vision- language model,

    M. Xu, Z. Zhang, F. Wei, Y . Lin, Y . Cao, H. Hu, and X. Bai, “A simple baseline for zeroshot semantic segmentation with pre-trained vision- language model,” ECCV, 2022

  53. [53]

    Open-vocabulary semantic segmentation with mask- adapted clip,

    F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask- adapted clip,” in CVPR, 2023, pp. 7061–7070

  54. [54]

    Open-vocabulary object detection via vision and language knowledge distillation,

    X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” ICLR, 2022

  55. [55]

    Learning to prompt for open-vocabulary object detection with vision-language model,

    Y . Du, F. Wei, Z. Zhang, M. Shi, Y . Gao, and G. Li, “Learning to prompt for open-vocabulary object detection with vision-language model,” in CVPR, 2022, pp. 14 084–14 093

  56. [56]

    F-vlm: Open- vocabulary object detection upon frozen vision and language models,

    W. Kuo, Y . Cui, X. Gu, A. Piergiovanni, and A. Angelova, “F-vlm: Open- vocabulary object detection upon frozen vision and language models,” in ICLR, 2023

  57. [57]

    Cora: Adapting clip for open- vocabulary detection with region prompting and anchor pre-matching,

    X. Wu, F. Zhu, R. Zhao, and H. Li, “Cora: Adapting clip for open- vocabulary detection with region prompting and anchor pre-matching,” JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXX XXXX 18 in CVPR, 2023, pp. 7031–7040

  58. [58]

    Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,

    Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” NeurIPS, vol. 36, 2024

  59. [59]

    Grounded language-image pre-training,

    L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in CVPR, 2022, pp. 10 965–10 975

  60. [60]

    Uniter: Universal image-text representation learning,

    Y .-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y . Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in ECCV, 2020

  61. [61]

    Vinvl: Revisiting visual representations in vision-language models,

    P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y . Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in CVPR, 2021, pp. 5579–5588

  62. [62]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023

  63. [63]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations,

    R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shammaet al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,”IJCV, vol. 123, pp. 32–73, 2017

  64. [64]

    Faster r-cnn: Towards real-time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” NeurIPS, vol. 28, 2015

  65. [65]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778

  66. [66]

    Imagenet large scale visual recognition challenge,

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, pp. 211–252, 2015

  67. [67]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014

  68. [68]

    The many faces of robustness: A critical analysis of out-of-distribution generalization,

    D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al. , “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in ICCV, 2021, pp. 8340–8349

  69. [69]

    Learning robust global representations by penalizing local predictive power,

    H. Wang, S. Ge, Z. Lipton, and E. P. Xing, “Learning robust global representations by penalizing local predictive power,” NeurIPS, 2019

  70. [70]

    Natural adversarial examples,

    D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in CVPR, 2021, pp. 15 262–15 271

  71. [71]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,

    P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in ACL, 2018, pp. 2556–2565

  72. [72]

    Making the most of text semantics to improve biomedical vision– language processing,

    B. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle et al., “Making the most of text semantics to improve biomedical vision– language processing,” in ECCV, 2022

  73. [73]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al. , “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020

  74. [74]

    Align before fuse: Vision and language representation learning with momentum distillation,

    J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” NeurIPS, vol. 34, pp. 9694–9705, 2021

  75. [75]

    Evaluating the visualization of what a deep neural network has learned,

    W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. M ¨uller, “Evaluating the visualization of what a deep neural network has learned,” IEEE Trans. Neural Netw Learn Syst , vol. 28(11), pp. 2660–73, 2016

  76. [76]

    Top- down neural attention by excitation backprop,

    J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top- down neural attention by excitation backprop,” IJCV, vol. 126(10), pp. 1084–102, 2018

  77. [77]

    Odam: Gradient-based instance-specific visual explanation for object detection,

    C. Zhao and A. B. Chan, “Odam: Gradient-based instance-specific visual explanation for object detection,” in ICLR, 2022

  78. [78]

    Large- scale unsupervised semantic segmentation,

    S. Gao, Z.-Y . Li, M.-H. Yang, M.-M. Cheng, J. Han, and P. Torr, “Large- scale unsupervised semantic segmentation,” TPAMI, 2022

  79. [79]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,

    J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in CVPR, 2017

  80. [80]

    Concreteness ratings for 40 thousand generally known english word lemmas,

    M. Brysbaert, A. B. Warriner, and V . Kuperman, “Concreteness ratings for 40 thousand generally known english word lemmas,” Behavior research methods, vol. 46, pp. 904–911, 2014

Showing first 80 references.