pith. sign in

arxiv: 2512.07245 · v2 · pith:EAPVQVBFnew · submitted 2025-12-08 · 💻 cs.CV

Zero-Shot Textual Explanations via Translating Decision-Critical Features

Pith reviewed 2026-05-21 18:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot textual explanationsdecision-critical featuresimage classifier interpretabilityCLIP alignmentsparse autoencoderneuron contribution analysisfaithful explanationstransformer interpretability
0
0 comments X

The pith

TEXTER generates textual explanations for image classifier decisions by isolating decision-critical features before aligning them with language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that zero-shot textual explanations can better reflect a classifier's actual reasoning if decision-critical features are isolated first rather than aligning global image features with language. TEXTER does this by identifying the neurons that contribute to a prediction, emphasizing the features those neurons encode, and then mapping only those emphasized features into CLIP space to retrieve relevant text. A sparse autoencoder is added to improve disentanglement and interpretability, especially for transformer architectures. This matters to a sympathetic reader because existing methods tend to describe visible content in the image instead of the specific elements driving the model's output. If the approach holds, explanations become more faithful to the internal decision process without requiring task-specific training data.

Core claim

TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model's reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER provides more faithful and interpretable explanations than existing methods.

What carries the argument

Neuron-based isolation of decision-critical features before CLIP alignment, with sparse autoencoding for transformers.

If this is right

  • Textual explanations will describe the specific elements that drove the classifier's output rather than general image content.
  • Explanation faithfulness will increase because only prediction-relevant features are aligned with language.
  • Sparse autoencoders will enhance human interpretability of the resulting descriptions, especially on transformer backbones.
  • Zero-shot explanations become available for any pretrained classifier without additional labeled explanation data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The neuron-isolation step could be adapted to other model types, such as those processing text or audio, by identifying analogous critical units.
  • Selective feature emphasis before cross-modal mapping may prove useful for improving alignment quality in other vision-language tasks.
  • One could test whether the same isolation principle improves post-hoc explanations for regression or detection models beyond simple classification.
  • If the method scales, it suggests that interpretability gains can come from post-processing feature selection rather than model retraining.

Load-bearing premise

The neurons identified as contributing to the prediction encode precisely the decision-critical features whose mapping into CLIP space will faithfully reflect the classifier's internal reasoning.

What would settle it

If faithfulness metrics show that TEXTER explanations do not align better with the classifier's output changes under feature ablation than global-feature baselines, the advantage of isolating decision-critical features would not hold.

Figures

Figures reproduced from arXiv: 2512.07245 by Hiroshi Kera, Kazuhiko Kawamoto, Toshinori Yamauchi.

Figure 1
Figure 1. Figure 1: Comparison between Text-To-Concept [29] and the pro￾posed TEXTER for explaining a cat prediction. Text-To-Concept, which relies on global image features, produces the explanation “cushions,” describing dominant but irrelevant regions. In con￾trast, TEXTER isolates decision-critical features, such as whisker spots, through a concept image and translates it into the explana￾tion “whisker spots,” faithfully r… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed TEXTER. The left part illu [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of generated explanations between Tex [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of the textual explanatio [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of the generated explanations between [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Textual explanations make image classifier decisions transparent by describing the prediction rationale in natural language. Large vision-language models can generate captions but are designed for general visual understanding, not classifier-specific reasoning. Existing zero-shot explanation methods align global image features with language, producing descriptions of what is visible rather than what drives the prediction. We propose TEXTER, which overcomes this limitation by isolating decision-critical features before alignment. TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons -- i.e., the decision-critical features. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model's reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER provides more faithful and interpretable explanations than existing methods. The code is available at \url{https://github.com/tttt-0814/TEXTER}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes TEXTER, a zero-shot method for textual explanations of image classifier decisions. It first identifies neurons contributing to the prediction, emphasizes the features encoded by those neurons (termed decision-critical features), maps the resulting vector into CLIP space, and retrieves the nearest text descriptions. A sparse autoencoder is used to improve interpretability for Transformer architectures. The authors report that extensive experiments demonstrate superior faithfulness and interpretability relative to prior zero-shot alignment methods.

Significance. If the faithfulness claims are substantiated, the work would offer a practical route to classifier-specific explanations that avoid the generic visual descriptions produced by global feature alignment. The open release of code supports reproducibility and could facilitate follow-up studies in explainable computer vision.

major comments (1)
  1. [Method (central construction)] The faithfulness claim rests on the assumption that CLIP-space nearest-neighbor retrieval from neuron-emphasized features recovers language that describes the classifier's actual decision boundary. No controlled ablation is reported that isolates the projection step by comparing retrieval from raw image features versus neuron-emphasized features while holding the input image fixed; without this comparison the contribution of the decision-critical feature isolation cannot be separated from CLIP's general visual-textual correlations.
minor comments (1)
  1. [Abstract] The abstract states that 'extensive experiments' demonstrate superiority but does not name the datasets, quantitative metrics, or baseline methods; adding these details would clarify the evaluation scope.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The feedback highlights an important aspect of validating the core contribution of decision-critical feature isolation. We address the major comment below and will revise the manuscript to incorporate the suggested analysis.

read point-by-point responses
  1. Referee: [Method (central construction)] The faithfulness claim rests on the assumption that CLIP-space nearest-neighbor retrieval from neuron-emphasized features recovers language that describes the classifier's actual decision boundary. No controlled ablation is reported that isolates the projection step by comparing retrieval from raw image features versus neuron-emphasized features while holding the input image fixed; without this comparison the contribution of the decision-critical feature isolation cannot be separated from CLIP's general visual-textual correlations.

    Authors: We agree that a controlled ablation holding the input image fixed would more cleanly isolate the benefit of the neuron-emphasis step. While our current experiments already compare TEXTER against global-alignment baselines on the same images and report improved faithfulness metrics, they do not include the exact raw-versus-emphasized retrieval comparison on identical inputs. In the revised manuscript we will add this ablation: for a fixed set of images we will retrieve nearest-neighbor text using (i) raw CLIP image features and (ii) our neuron-emphasized features, then quantify the difference in explanation faithfulness and human interpretability. The new results will be presented in a dedicated table and discussed in the experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity: method assembles external components without self-referential reduction

full rationale

The paper's core construction identifies neurons contributing to a classifier prediction, emphasizes the encoded features, projects the result into CLIP space, and retrieves nearest-neighbor text. This sequence is presented as a procedural pipeline using standard external tools (CLIP, sparse autoencoders) rather than any derivation that equates an output quantity to a fitted input or prior self-citation by construction. No equations are supplied that would make the faithfulness claim tautological, and the experimental comparisons are treated as independent validation. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract. The method builds on existing neural-network concepts and pre-trained models without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5687 in / 1039 out tokens · 59980 ms · 2026-05-21T18:03:12.094185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 3 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam et al. Gpt-4 technical report. arXiv: 2303.08774, 2023. 1, 2

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 5

  3. [3]

    Network Dissection: Quantifying Inter- pretability of Deep Visual Representations

    David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network Dissection: Quantifying Inter- pretability of Deep Visual Representations . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3319–3327, 2017. 2

  4. [4]

    P., and Lakkaraju, H

    Usha Bhalla, Alex Oesterling, Suraj Srinivas, Fl´ avio P . Calmon, and Himabindu Lakkaraju. Interpreting clip with sparse linear concept embeddings (splice). CoRR, abs/2402.10376, 2024. 4

  5. [5]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´ e J´ egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE International Conference on Computer Vision (ICCV), 2021. 5

  6. [6]

    Devil: Decoding vision features into language

    Meghal Dani, Isabel Rio-Torto, Stephan Alaniz, and Zeyn ep Akata. Devil: Decoding vision features into language. CoRR, 2023. 2

  7. [7]

    Imagenet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 248–255, 2009. 5

  8. [8]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov , Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In The International Conference on Learning Repre- sentations (ICLR), 2021. 5

  9. [9]

    Mark Everingham, Luc V an Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pas- cal visual object classes (voc) challenge. Int. J. Comput. Vis., 88(2):303–338, 2010. 6

  10. [10]

    Unlocking feature visualiza- tion for deep network with MAgnitude constrained optimiza- tion

    Thomas FEL, Thibaut Boissin, Victor Boutin, Agustin Ma r- tin Picard, Paul Novello, Julien Colin, Drew Linsley, Tom ROUSSEAU, Remi Cadene, Lore Goetschalckx, Laurent Gardes, and Thomas Serre. Unlocking feature visualiza- tion for deep network with MAgnitude constrained optimiza- tion. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 2, 3, 4

  11. [11]

    A holistic approach to unifying automatic concept extraction and concept importance estimation

    Thomas FEL, Victor Boutin, Louis B´ ethune, Remi Ca- dene, Mazda Moayeri, L´ eo And´ eol, Mathieu Chalvidal, and Thomas Serre. A holistic approach to unifying automatic concept extraction and concept importance estimation. In Thirty-seventh Conference on Neural Information Process- ing Systems, 2023. 4

  12. [12]

    Craft: Concept recursive activation factori za- tion for explainability, 2023

    Thomas Fel, Agustin Picard, Louis Bethune, Thibaut Boissin, David Vigouroux, Julien Colin, R´ emi Cad` ene, and Thomas Serre. Craft: Concept recursive activation factori za- tion for explainability, 2023. 2

  13. [13]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. In The International Conference on Learning Representations (ICLR), 2025. 2, 4, 1

  14. [14]

    Vital: More understandable feature visualization through distri bu- tion alignment and relevant information flow, 2025

    Ada Gorgun, Bernt Schiele, and Jonas Fischer. Vital: More understandable feature visualization through distri bu- tion alignment and relevant information flow, 2025. 4

  15. [15]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 5

  16. [16]

    Generating vi- sual explanations

    Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Je ff Donahue, Bernt Schiele, and Trevor Darrell. Generating vi- sual explanations. In Proceedings of the European Confer- ence on Computer Vision (ECCV) , pages 3–19, 2016. 2

  17. [17]

    Natural language descriptions of deep visual features

    Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. In The In- ternational Conference on Learning Representations (ICLR), 2022. 4

  18. [18]

    CLIPScore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bra s, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, pages 7514–7528, 2021. 6

  19. [19]

    Iandola, Song Han, Matthew W

    Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parame- ters and ¡0.5mb model size, 2016. 7

  20. [20]

    Comparing th e decision-making mechanisms by transformers and cnns via explanation methods

    Mingqi Jiang, Saeed Khorram, and Li Fuxin. Comparing th e decision-making mechanisms by transformers and cnns via explanation methods. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 9546–9555, 2024. 4, 6, 8

  21. [21]

    C ai, James Wexler, Fernanda B

    Been Kim, Martin Wattenberg, Justin Gilmer, Carrie J. C ai, James Wexler, Fernanda B. Vi´ egas, and Rory Sayres. In- terpretability beyond feature attribution: Quantitative test- ing with concept activation vectors (tcav). In Proceedings of the International Conference on Machine Learning (ICML) , pages 2673–2682, 2018. 2

  22. [22]

    Concept bottleneck models

    Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In Proceedings of the Interna- tional Conference on Machine Learning (ICML) , 2020. 1, 2

  23. [23]

    Imagenet classification with deep convolutional neural net - works

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton . Imagenet classification with deep convolutional neural net - works. In Advances in Neural Information Processing Sys- tems (NeurIPS), 2012. 7 9

  24. [24]

    BLIP: Bootstrapping language-image pre-training for unifi ed vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unifi ed vision-language understanding and generation. In Proceed- ings of the 39th International Conference on Machine Learn- ing, pages 12888–12900, 2022. 1, 2

  25. [25]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Y ong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, pages 34892–34916, 2023. 1, 2

  26. [26]

    Hybrid concept bot- tleneck models

    Yang Liu, Tianwei Zhang, and Shi Gu. Hybrid concept bot- tleneck models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 20179–20189, 2025. 1, 2

  27. [27]

    Lundberg and Su-In Lee

    Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural In- formation Processing Systems (NeurIPS) , page 4768–4777,

  28. [28]

    Visual classification v ia description from large language models, 2023

    Sachit Menon and Carl V ondrick. Visual classification v ia description from large language models, 2023. 2, 5

  29. [29]

    Text-to-concept (and back) via cross-model alignment

    Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, and Sohe il Feizi. Text-to-concept (and back) via cross-model alignment. In Proceedings of the International Conference on Machine Learning (ICML), 2023. 1, 2, 3, 4, 5, 7

  30. [30]

    Synthesizing the preferred inputs for neurons in neural networks via deep generator net- works

    Anh Nguyen, Alexey Dosovitskiy, Jason Y osinski, Thoma s Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator net- works. In Advances in Neural Information Processing Sys- tems (NeurIPS), page 3395–3403, 2016. 4

  31. [31]

    Bengio, Alexey Dosovitskiy,and Jason Y osinski

    Anh Nguyen, Jeff Clune, Y . Bengio, Alexey Dosovitskiy,and Jason Y osinski. Plug & play generative networks: Condi- tional iterative generation of images in latent space. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3510–3520, 2017. 4

  32. [32]

    Nguyen, and Tsui- Wei Weng

    Tuomas Oikarinen, Subhro Das, Lam M. Nguyen, and Tsui- Wei Weng. Label-free concept bottleneck models. In The In- ternational Conference on Learning Representations (ICLR),

  33. [33]

    Oikarinen and Tsui-Wei Weng

    Tuomas P . Oikarinen and Tsui-Wei Weng. Clip-dissect: A u- tomatic description of neuron representations in deep visi on networks. In The International Conference on Learning Rep- resentations (ICLR), 2023. 4

  34. [34]

    Feature visualization

    Christopher Olah, Ludwig Schubert, and Alexander Mord v- intsev. Feature visualization. Distill, 2017. 4

  35. [35]

    Gpt-3.5 turbo models

    OpenAI. Gpt-3.5 turbo models. https://platform.openai.com/docs/models/gpt-3-5 ,

  36. [36]

    Accessed 2025-10-13. 5

  37. [37]

    Multimodal explanations: Justifying deci- sions and pointing to the evidence

    Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Mar- cus Rohrbach. Multimodal explanations: Justifying deci- sions and pointing to the evidence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2018. 2

  38. [38]

    Language models are unsuper- vised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dari o Amodei, and Ilya Sutskever. Language models are unsuper- vised multitask learners. OpenAI, 2019. 2

  39. [39]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML) , pages 8748–8763, 2021. 1, 2, 4

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML) , pages 8748–8763, 2021. 4

  41. [41]

    Do vision trans- formers see like convolutional neural networks? In Advances in Neural Information Processing Systems (NeurIPS) , pages 12116–12128, 2021

    Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans- formers see like convolutional neural networks? In Advances in Neural Information Processing Systems (NeurIPS) , pages 12116–12128, 2021. 4, 6, 8

  42. [42]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 7

  43. [43]

    Sophia Koepke, Hendrik P

    Leonard Salewski, A. Sophia Koepke, Hendrik P . A. Lensc h, and Zeynep Akata. Zero-shot translation of attention patterns in vqa models to natural language. In Pattern Recognition, pages 378–393, Cham, 2024. 2

  44. [44]

    Uni-nlx: Unify- ing textual explanations for vision and vision-language tasks

    Fawaz Sammani and Nikos Deligiannis. Uni-nlx: Unify- ing textual explanations for vision and vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) W orkshops , pages 4634–4639,

  45. [45]

    Zero-shot natura l language explanations

    Fawaz Sammani and Nikos Deligiannis. Zero-shot natura l language explanations. In The International Conference on Learning Representations (ICLR), 2025. 1, 2, 3, 6

  46. [46]

    Nlx-gpt: A model for natural language explanations in vision and vision-language tasks

    Fawaz Sammani, Tanmoy Mukherjee, and Nikos Deligian- nis. Nlx-gpt: A model for natural language explanations in vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8322–8332, 2022. 1, 2

  47. [47]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna V edantam, Devi Parikh, and Dhruv Ba- tra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna V edantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV) , pages 618–626, 2017. 5

  48. [48]

    Incremental residual con- cept bottleneck models

    Chenming Shang, Shiji Zhou, Hengyuan Zhang, Xinzhe Ni, Y ujiu Yang, and Y uwang Wang. Incremental residual con- cept bottleneck models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11030–11040, 2024. 2

  49. [49]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pages 2556–2565, 2018. 2

  50. [50]

    Rupprecht, and Andrea V eda ldi

    Aleksandar Shtedritski, C. Rupprecht, and Andrea V eda ldi. What does clip know about a red circle? visual prompt engi- neering for vlms, 2023. 2 10

  51. [51]

    Axiomat ic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomat ic attribution for deep networks. In Proceedings of the In- ternational Conference on Machine Learning (ICML) , page 3319–3328, 2017. 1, 4

  52. [52]

    Qwen2.5-vl, 2025

    Qwen Team. Qwen2.5-vl, 2025. 5

  53. [53]

    Derpanis

    Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matt hew Kowal, and Konstantinos G. Derpanis. Universal sparse autoencoders: Interpretable cross-model concept alignme nt. In Proceedings of the International Conference on Machine Learning (ICML), 2025. 4, 6

  54. [54]

    Griffiths

    Shikhar Tuli, Ishita Dasgupta, Erin Grant, and Thomas L . Griffiths. Are convolutional neural networks or transforme rs more like human vision? ArXiv, abs/2105.07197, 2021. 4, 6, 8

  55. [55]

    Learning bottleneck concepts in image classifica- tion

    Bowen Wang, Liangzhi Li, Y uta Nakashima, and Hajime Na- gahara. Learning bottleneck concepts in image classifica- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 10962– 10971, 2023. 2

  56. [56]

    Score-cam: Score-weighted visual explanations for convolutional neu ral networks

    Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neu ral networks. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) W orkshops , pages 111–119, 2020. 5

  57. [57]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan , Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 5

  58. [58]

    Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

    Y ue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19187–19197, 2023. 2, 5

  59. [59]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shech t- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018. 7

  60. [60]

    Ehinger, and Benjamin I

    Ruihan Zhang, Prashan Madumal, Tim Miller, Krista A. Ehinger, and Benjamin I. P . Rubinstein. Invertible concept - based explanations for CNN models with non-negative con- cept activation vectors. In Proceedings of the AAAI Confer- ence on Artificial Intelligence , 2021. 2 11 Zero-Shot Textual Explanations via Translating Decision-Critical Features Supplem...

  61. [61]

    on the same dataset used to train each classifier (e.g., ImageNet) with batch size 1024, learning rate 5 × 10− 4, and the Adam optimizer for 10 epochs. A.2. Details of concept bank construction We use an LLM and a VLM to generate the concept bank B(x, c ). Below, we describe the prompts used for each model. The LLM is utilized to generate concepts that are...

  62. [62]

    Generate GENERAL concepts that can apply to many different photos of the same object type

  63. [65]

    DO NOT include class names or object names directly. Q: What are useful visual features for distinguishing a lemur in a photo? A: There are several useful visual features to tell there is a lemur in a photo: - long tail - large eyes - gray fur - trees - branches - forest Q: What are useful features for distinguishing a {class_name} in a photo? Already gen...

  64. [66]

    Generate DETAILED and SPECIFIC concepts that can apply to this image

  65. [67]

    Include both OBJECT features (e.g., shape, color, parts) AND CONTEXT features (e.g., background, environment, setting)

  66. [68]

    Keep concepts short and specific (1-3 words)

  67. [69]

    a photo of {class name} show- ing T ,

    DO NOT include class names or object names directly. Examples: Q: Look at this image carefully. Based on what you can actually see in the image, identify useful visual features that help distinguish this as a koi fish. A: There are several useful visual features to tell there is a koi fish in a photo: - bright orange scales - curved tail fin - spotted pat...

  68. [70]

    bright orange coloration,

    As expected, Text-To-Concept achieves the best performance across all models, which is consistent with its design: it aligns global image features with text and is intended to describe the input image itself. These results therefore complement Tabs. 2 and 3: while Text-To-Concept is better aligned with the input images, the proposed method provides explan...