Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Pith reviewed 2026-05-23 02:18 UTC · model grok-4.3
The pith
Grad-ECLIP generates heat maps for CLIP by weighting token features with gradient-derived channel and spatial weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results by applying channel and spatial weights on token features.
What carries the argument
Gradient-derived channel and spatial weights applied to token features in CLIP encoders to generate visual and textual explanation heat maps.
If this is right
- Reveals the working mechanism of image-text matching in CLIP.
- Identifies strengths and limitations in CLIP's attribution identification.
- Explores the relationship between a word's concreteness and its usage in CLIP.
- Boosts fine-grained alignment in CLIP fine-tuning using text-specific saliency regions.
Where Pith is reading between the lines
- Similar gradient weighting could improve explanations in other multimodal models.
- The method might enable better debugging of vision-language models.
- Explanations could be integrated into training to enforce better alignment automatically.
Load-bearing premise
That applying gradient-based channel and spatial weights to token features will yield more faithful and higher-quality explanations than attention maps despite their sparsity in CLIP.
What would settle it
If quantitative metrics like insertion or deletion scores or human agreement show no improvement over state-of-the-art attention-based methods on CLIP explanation tasks.
Figures
read the original abstract
Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the effectiveness and superiority of Grad-ECLIP compared with the state-of-the-art methods. Furthermore, a series of analysis are conducted based on our visual and textual explanation results, from which we explore the working mechanism of image-text matching, the strengths and limitations in attribution identification of CLIP, and the relationship between the concreteness/abstractness of a word and its usage in CLIP. Finally, based on the ability of explanation map that indicates text-specific saliency region of input image, we also propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning. The code of Grad-ECLIP is available here: https://github.com/Cyang-Zhao/Grad-Eclip.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Grad-ECLIP, a gradient-based method for generating visual and textual explanations of CLIP image-text matching scores. It decomposes the CLIP encoder to relate similarity to intermediate spatial features, then applies channel and spatial weights derived from gradients to token features (instead of relying on sparse self-attention maps) to produce heatmaps; qualitative/quantitative evaluations claim superiority over prior methods, followed by analyses of CLIP's attribution behavior and an application that uses the maps to improve fine-grained alignment during CLIP fine-tuning. Code is released.
Significance. If the heatmaps can be shown to be faithful, the work would address an important gap in VLM interpretability by providing an alternative to attention-based explanations and could support better understanding and fine-tuning of CLIP-like models; the open-source code is a positive contribution to reproducibility.
major comments (2)
- [§4] §4 (Quantitative Evaluation): The reported superiority over attention-based and other explanation methods rests on unspecified quantitative metrics; standard faithfulness protocols (insertion/deletion AUC, pointing game against human annotations, or controlled synthetic pairs) are not mentioned, leaving open whether the channel/spatial weighting isolates influential regions/words or produces artifacts.
- [§3] §3 (Method): The central claim that gradient-derived channel and spatial weights on token features yield higher-quality explanations than sparse attention maps requires an ablation isolating the contribution of the weighting scheme versus raw gradients or attention; without it, the advantage over prior Transformer interpretation methods is not load-bearing.
minor comments (2)
- [Abstract, §5] Abstract and §5: The application section claims the explanation maps boost fine-grained alignment, but the quantitative improvement (e.g., retrieval or alignment scores before/after) should be reported with error bars for clarity.
- [§3] Notation: The decomposition relating matching similarity to intermediate features would benefit from an explicit equation showing how the final similarity score is expressed in terms of the weighted token features.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We provide point-by-point responses to the major comments and indicate the revisions planned for the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Quantitative Evaluation): The reported superiority over attention-based and other explanation methods rests on unspecified quantitative metrics; standard faithfulness protocols (insertion/deletion AUC, pointing game against human annotations, or controlled synthetic pairs) are not mentioned, leaving open whether the channel/spatial weighting isolates influential regions/words or produces artifacts.
Authors: We agree that the quantitative metrics should be more explicitly described and that standard faithfulness protocols would strengthen the evaluation. In the revised manuscript, we will specify the metrics used in §4 and incorporate insertion/deletion AUC as well as pointing game evaluations against human annotations to better validate the faithfulness of the explanations. revision: yes
-
Referee: [§3] §3 (Method): The central claim that gradient-derived channel and spatial weights on token features yield higher-quality explanations than sparse attention maps requires an ablation isolating the contribution of the weighting scheme versus raw gradients or attention; without it, the advantage over prior Transformer interpretation methods is not load-bearing.
Authors: We recognize that an ablation study would help isolate the effect of the channel and spatial weighting. Although the current manuscript includes comparisons to attention-based methods, we will add a dedicated ablation in the revised version comparing Grad-ECLIP to variants using only raw gradients and only attention maps to substantiate the contribution of the weighting scheme. revision: yes
Circularity Check
No circularity: derivation is self-contained gradient-based method
full rationale
The paper's core derivation decomposes the CLIP encoder to relate similarity scores to intermediate spatial features and applies gradient-derived channel/spatial weights to token features. This is a direct methodological construction presented as an alternative to sparse attention maps, with no self-citations invoked as uniqueness theorems, no parameters fitted to data then relabeled as predictions, and no equations that reduce to their own inputs by definition. Evaluations are claimed to verify effectiveness independently of the method's construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in ICML, 2021
2021
-
[2]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,
S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in CVPR, 2021, pp. 3558–3568
2021
-
[3]
Domain generalization by mutual- information regularization with pre-trained models,
J. Cha, K. Lee, S. Park, and S. Chun, “Domain generalization by mutual- information regularization with pre-trained models,” in ECCV, 2022
2022
-
[4]
Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,
H. Luo, L. Ji, M. Zhong, Y . Chen, W. Lei, N. Duan, and T. Li, “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022
2022
-
[5]
Cris: Clip-driven referring image segmentation,
Z. Wang, Y . Lu, Q. Li, X. Tao, Y . Guo, M. Gong, and T. Liu, “Cris: Clip-driven referring image segmentation,” in CVPR, 2022
2022
-
[6]
Groupvit: Semantic segmentation emerges from text supervision,
J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in CVPR, 2022
2022
-
[7]
CoCa: Contrastive Captioners are Image-Text Foundation Models
J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation models,” arXiv:2205.01917, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022, pp. 12 888–12 900
2022
-
[9]
Learning to prompt for vision- language models,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,” IJCV, vol. 130, no. 9, pp. 2337–2348, 2022
2022
-
[10]
Prompt learning with optimal transport for vision-language models,
G. Chen, W. Yao, X. Song, X. Li, Y . Rao, and K. Zhang, “Prompt learning with optimal transport for vision-language models,” arXiv:2210.01253, 2022
-
[11]
Oscar: Object-semantics aligned pre-training for vision-language tasks,
X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV, 2020
2020
-
[12]
Position-guided text prompt for vision-language pre-training,
J. Wang, P. Zhou, M. Z. Shou, and S. Yan, “Position-guided text prompt for vision-language pre-training,” in CVPR, 2023, pp. 23 242–23 251
2023
-
[13]
Regionclip: Region-based language-image pretraining,
Y . Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y . Liet al., “Regionclip: Region-based language-image pretraining,” in CVPR, 2022, pp. 16 793–16 803
2022
-
[14]
Quantifying attention flow in transformers,
S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,” in ACL, 2020, pp. 4190–4197
2020
-
[15]
Transformer interpretability beyond attention visualization,
H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” in CVPR, 2021
2021
-
[16]
Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,
——, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” in ICCV, 2021, pp. 397–406
2021
-
[17]
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,
S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M ¨uller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PloS one , vol. 10(7):e0130140, 2015
2015
-
[18]
Clip surgery for better explainability with enhancement in open- vocabulary tasks
Y . Li, H. Wang, Y . Duan, and X. Li, “Clip surgery for better explainability with enhancement in open-vocabulary tasks,” arXiv:2304.05653, 2023
-
[19]
Extract free dense labels from clip,
C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in ECCV. Springer, 2022, pp. 696–712
2022
-
[20]
Grad-cam: Visual explanations from deep networks via gradient-based localization,
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626
2017
-
[21]
RISE: Randomized Input Sampling for Explanation of Black-box Models
V . Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,” arXiv:1806.07421, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
Visual explanations of image- text representations via multi-modal information bottleneck attribution,
Y . Wang, T. G. Rudner, and A. G. Wilson, “Visual explanations of image- text representations via multi-modal information bottleneck attribution,” NeurIPS, 2024
2024
-
[23]
Exploring vi- sual interpretability for contrastive language-image pre-training,
Y . Li, H. Wang, Y . Duan, H. Xu, and X. Li, “Exploring vi- sual interpretability for contrastive language-image pre-training,” arXiv:2209.07046, 2022
-
[24]
Visualizing and understanding convolu- tional networks,
M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- tional networks,” in ECCV, 2014, pp. 818–833
2014
-
[25]
Layer- cam: Exploring hierarchical class activation maps for localization,
P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layer- cam: Exploring hierarchical class activation maps for localization,” TIP, vol. 30, pp. 5875–5888, 2021
2021
-
[26]
Full-gradient representation for neural net- work visualization,
S. Srinivas and F. Fleuret, “Full-gradient representation for neural net- work visualization,” in NeurIPS, vol. 32, 2019
2019
-
[27]
Region-aware pretraining for open- vocabulary object detection with vision transformers,
D. Kim, A. Angelova, and W. Kuo, “Region-aware pretraining for open- vocabulary object detection with vision transformers,” in CVPR, 2023
2023
-
[28]
Clipself: Vision transformer distills itself for open-vocabulary dense prediction,
S. Wu, W. Zhang, L. Xu, S. Jin, X. Li, W. Liu, and C. C. Loy, “Clipself: Vision transformer distills itself for open-vocabulary dense prediction,” arXiv preprint arXiv:2310.01403, 2023
-
[29]
Gradient-based visual explanation for transformer-based clip,
C. Zhao, K. Wang, X. Zeng, R. Zhao, and A. B. Chan, “Gradient-based visual explanation for transformer-based clip,” in ICML, 2024
2024
-
[30]
Camp: Cross-modal adaptive message passing for text-image retrieval,
Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, and J. Shao, “Camp: Cross-modal adaptive message passing for text-image retrieval,” inICCV, 2019
2019
-
[31]
Show, attend and tell: Neural image caption generation with visual attention,
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y . Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015, pp. 2048–2057
2015
-
[32]
Vqa: Visual question answering,
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015
2015
-
[33]
Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in ICCV, 2015
2015
-
[34]
Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,
A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in WACV, 2018, pp. 839–847
2018
-
[35]
Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization,
H. G. Ramaswamy et al. , “Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization,” in WACV, 2020
2020
-
[36]
Score-cam: Score-weighted visual explanations for convolutional neural networks,
H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convolutional neural networks,” in CVPR Workshops, 2020, pp. 24–25
2020
-
[37]
Ss-cam: Smoothed score-cam for sharper visual feature localization,
H. Wang, R. Naidu, J. Michael, and S. S. Kundu, “Ss-cam: Smoothed score-cam for sharper visual feature localization,” arXiv:2006.14255, 2020
-
[38]
” why should i trust you?
M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” in ACM SIGKDD, 2016
2016
-
[39]
Interpretable explanations of black boxes by meaningful perturbation,
R. C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” in ICCV, 2017, pp. 3429–3437
2017
-
[40]
A unified approach to interpreting model predictions,
S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” NeurIPS, vol. 30, 2017
2017
-
[41]
Interpretable and fine-grained visual explanations for convo- lutional neural networks,
J. Wagner, J. M. Kohler, T. Gindele, L. Hetzel, J. T. Wiedemer, and S. Behnke, “Interpretable and fine-grained visual explanations for convo- lutional neural networks,” in CVPR, 2019, pp. 9097–9107
2019
-
[42]
Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation,
J. Lee, J. Yi, C. Shin, and S. Yoon, “Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation,” in CVPR, 2021
2021
-
[43]
Black-box explanation of object detectors via saliency maps,
V . Petsiuk, R. Jain, V . Manjunatha, V . I. Morariu, A. Mehra, V . Ordonez, and K. Saenko, “Black-box explanation of object detectors via saliency maps,” in CVPR, 2021, pp. 11 443–11 452
2021
-
[44]
Explaining nonlinear classification decisions with deep taylor decom- position,
G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. M ¨uller, “Explaining nonlinear classification decisions with deep taylor decom- position,” Pattern recognition, vol. 65, pp. 211–222, 2017
2017
-
[45]
Relative attributing propagation: Interpreting the comparative contributions of individual units in deep neural networks,
W.-J. Nam, S. Gur, J. Choi, L. Wolf, and S.-W. Lee, “Relative attributing propagation: Interpreting the comparative contributions of individual units in deep neural networks,” in AAAI, 2020
2020
-
[46]
Learning important fea- tures through propagating activation differences,
A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important fea- tures through propagating activation differences,” in ICML, 2017
2017
-
[47]
Understanding individual decisions of cnns via contrastive backpropagation,
J. Gu, Y . Yang, and V . Tresp, “Understanding individual decisions of cnns via contrastive backpropagation,” in ACCV, 2019, pp. 119–134
2019
-
[48]
Explaining convolutional neural networks using softmax gradient layer-wise relevance propagation,
B. K. Iwana, R. Kuroki, and S. Uchida, “Explaining convolutional neural networks using softmax gradient layer-wise relevance propagation,” in ICCVW, 2019
2019
-
[49]
Attcat: Explaining transformers via attentive class activation tokens,
Y . Qiang, D. Pan, C. Li, X. Li, R. Jang, and D. Zhu, “Attcat: Explaining transformers via attentive class activation tokens,” NeurIPS, 2022
2022
-
[50]
Vit-cx: Causal explana- tion of vision transformers,
W. Xie, X.-H. Li, C. C. Cao, and N. L. Zhang, “Vit-cx: Causal explana- tion of vision transformers,” pp. 1569–1577, 2023
2023
-
[51]
X-pruner: explainable pruning for vision trans- formers,
L. Yu and W. Xiang, “X-pruner: explainable pruning for vision trans- formers,” in CVPR, 2023, pp. 24 355–24 363
2023
-
[52]
A simple baseline for zeroshot semantic segmentation with pre-trained vision- language model,
M. Xu, Z. Zhang, F. Wei, Y . Lin, Y . Cao, H. Hu, and X. Bai, “A simple baseline for zeroshot semantic segmentation with pre-trained vision- language model,” ECCV, 2022
2022
-
[53]
Open-vocabulary semantic segmentation with mask- adapted clip,
F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask- adapted clip,” in CVPR, 2023, pp. 7061–7070
2023
-
[54]
Open-vocabulary object detection via vision and language knowledge distillation,
X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” ICLR, 2022
2022
-
[55]
Learning to prompt for open-vocabulary object detection with vision-language model,
Y . Du, F. Wei, Z. Zhang, M. Shi, Y . Gao, and G. Li, “Learning to prompt for open-vocabulary object detection with vision-language model,” in CVPR, 2022, pp. 14 084–14 093
2022
-
[56]
F-vlm: Open- vocabulary object detection upon frozen vision and language models,
W. Kuo, Y . Cui, X. Gu, A. Piergiovanni, and A. Angelova, “F-vlm: Open- vocabulary object detection upon frozen vision and language models,” in ICLR, 2023
2023
-
[57]
Cora: Adapting clip for open- vocabulary detection with region prompting and anchor pre-matching,
X. Wu, F. Zhu, R. Zhao, and H. Li, “Cora: Adapting clip for open- vocabulary detection with region prompting and anchor pre-matching,” JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, XXX XXXX 18 in CVPR, 2023, pp. 7031–7040
2023
-
[58]
Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,
Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” NeurIPS, vol. 36, 2024
2024
-
[59]
Grounded language-image pre-training,
L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in CVPR, 2022, pp. 10 965–10 975
2022
-
[60]
Uniter: Universal image-text representation learning,
Y .-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y . Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in ECCV, 2020
2020
-
[61]
Vinvl: Revisiting visual representations in vision-language models,
P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y . Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in CVPR, 2021, pp. 5579–5588
2021
-
[62]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Visual genome: Connecting language and vision using crowdsourced dense image annotations,
R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shammaet al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,”IJCV, vol. 123, pp. 32–73, 2017
2017
-
[64]
Faster r-cnn: Towards real-time object detection with region proposal networks,
S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” NeurIPS, vol. 28, 2015
2015
-
[65]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778
2016
-
[66]
Imagenet large scale visual recognition challenge,
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, pp. 211–252, 2015
2015
-
[67]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014
2014
-
[68]
The many faces of robustness: A critical analysis of out-of-distribution generalization,
D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al. , “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in ICCV, 2021, pp. 8340–8349
2021
-
[69]
Learning robust global representations by penalizing local predictive power,
H. Wang, S. Ge, Z. Lipton, and E. P. Xing, “Learning robust global representations by penalizing local predictive power,” NeurIPS, 2019
2019
-
[70]
Natural adversarial examples,
D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in CVPR, 2021, pp. 15 262–15 271
2021
-
[71]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,
P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in ACL, 2018, pp. 2556–2565
2018
-
[72]
Making the most of text semantics to improve biomedical vision– language processing,
B. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle et al., “Making the most of text semantics to improve biomedical vision– language processing,” in ECCV, 2022
2022
-
[73]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al. , “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[74]
Align before fuse: Vision and language representation learning with momentum distillation,
J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” NeurIPS, vol. 34, pp. 9694–9705, 2021
2021
-
[75]
Evaluating the visualization of what a deep neural network has learned,
W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. M ¨uller, “Evaluating the visualization of what a deep neural network has learned,” IEEE Trans. Neural Netw Learn Syst , vol. 28(11), pp. 2660–73, 2016
2016
-
[76]
Top- down neural attention by excitation backprop,
J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top- down neural attention by excitation backprop,” IJCV, vol. 126(10), pp. 1084–102, 2018
2018
-
[77]
Odam: Gradient-based instance-specific visual explanation for object detection,
C. Zhao and A. B. Chan, “Odam: Gradient-based instance-specific visual explanation for object detection,” in ICLR, 2022
2022
-
[78]
Large- scale unsupervised semantic segmentation,
S. Gao, Z.-Y . Li, M.-H. Yang, M.-M. Cheng, J. Han, and P. Torr, “Large- scale unsupervised semantic segmentation,” TPAMI, 2022
2022
-
[79]
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,
J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in CVPR, 2017
2017
-
[80]
Concreteness ratings for 40 thousand generally known english word lemmas,
M. Brysbaert, A. B. Warriner, and V . Kuperman, “Concreteness ratings for 40 thousand generally known english word lemmas,” Behavior research methods, vol. 46, pp. 904–911, 2014
2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.