DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification
Pith reviewed 2026-05-10 17:35 UTC · model grok-4.3
The pith
DINO-QPM turns frozen DINOv2 patch embeddings into human-interpretable class-independent features while beating linear-probe accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DINO-QPM adapts the Quadratic Programming Enhanced Model as a lightweight interpretability adapter for strictly frozen DINO backbones. By using average-pooling of patch embeddings instead of the CLS token and imposing a sparsity loss that minimizes spatial scatter, it converts entangled high-dimensional features into contrastive, class-independent representations that support human-plausible global explanations and direct spatial localization in the input image.
What carries the argument
QPM adapter applied to average-pooled patch embeddings from a frozen DINO backbone, regularized by a sparsity loss that reduces background noise.
If this is right
- Classification accuracy exceeds that of a standard DINOv2 linear probe.
- Explanations become globally consistent and spatially localizable because patch embeddings connect directly to input space.
- A sparsity loss forces explanations to focus on relevant object parts instead of background noise.
- The full interpretability level previously available only with QPM becomes usable as a plug-in adapter for any frozen visual foundation model.
- The method outperforms other applicable techniques for frozen backbones on both accuracy and explanation-quality metrics.
Where Pith is reading between the lines
- The same adapter pattern could be tested on other frozen foundation models such as CLIP or larger ViT variants to check generalization.
- The sparsity-driven focus on object parts may reduce reliance on spurious correlations that plague many post-hoc explanation techniques.
- Standardized use of a Plausibility metric could encourage consistent benchmarking of explanation quality across future interpretable-vision papers.
- If the adapter preserves performance while adding interpretability, regulatory or safety-critical vision deployments could adopt it without retraining large backbones.
Load-bearing premise
Average-pooling patch embeddings plus a sparsity loss on a frozen backbone yields globally consistent human-plausible explanations without new fitting artifacts or loss of discriminative power.
What would settle it
If DINO-QPM classification accuracy falls below the DINOv2 linear probe on ImageNet or if its explanations fail to exceed competing methods on the introduced Plausibility metric, the central claim would be refuted.
Figures
read the original abstract
Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the \texttt{CLS} token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model's features and therefore enable spatial localisation of DINO-QPM's globally interpretable features within the input space. Furthermore, we apply a sparsity loss to minimise spatial scatter and background noise, ensuring that explanations are grounded in relevant object parts. With DINO-QPM we make the level of interpretability of QPM available as an adapter while exceeding the accuracy of DINOv2 linear probe. Evaluated through an introduced Plausibility metric and other interpretability metrics, extensive experiments demonstrate that DINO-QPM is superior to other applicable methods for frozen visual foundation models in both classification accuracy and explanation quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DINO-QPM, a lightweight adapter for strictly frozen DINOv2 backbones that replaces the CLS token with average-pooling over patch embeddings and adds a sparsity loss. This is claimed to convert entangled foundation-model features into contrastive, globally interpretable representations while exceeding the accuracy of a DINOv2 linear probe and improving explanation quality as measured by a newly introduced Plausibility metric together with other interpretability scores.
Significance. If the accuracy claim and the validity of the Plausibility metric are substantiated, the work would be significant for extending globally interpretable methods such as QPM to modern frozen visual foundation models without backbone retraining. The adapter design and explicit spatial-localization goal address a practical gap between high-performance feature extractors and human-plausible explanations.
major comments (3)
- [Abstract] Abstract: the headline claim that DINO-QPM exceeds DINOv2 linear-probe accuracy is stated without any numerical values, standard deviations, or references to specific tables or figures; the same paragraph introduces the Plausibility metric but supplies no definition, formula, or validation procedure against human judgments.
- [§3] §3 (Method): the central accuracy claim rests on the untested assumption that uniform average-pooling of patch embeddings plus a sparsity loss will preserve (or exceed) the discriminative power already encoded in the learned CLS token of a frozen DINOv2 backbone; no ablation isolating the pooling operator versus a CLS-based adapter is reported.
- [§4] §4 (Experiments): the sparsity-loss weight and QPM regularization parameters are listed as free hyperparameters; without an explicit statement that they were selected on a held-out validation split separate from the reported test sets, the reported superiority on both accuracy and interpretability metrics risks being inflated by post-hoc tuning.
minor comments (2)
- [§3] Notation for the QPM adapter layers and the exact form of the sparsity loss could be clarified with an equation block or pseudocode.
- [§4] Figure captions should explicitly state the backbone, dataset split, and whether the backbone is frozen when presenting qualitative explanation maps.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that DINO-QPM exceeds DINOv2 linear-probe accuracy is stated without any numerical values, standard deviations, or references to specific tables or figures; the same paragraph introduces the Plausibility metric but supplies no definition, formula, or validation procedure against human judgments.
Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript, we will include numerical accuracy improvements (with standard deviations) and direct references to the relevant tables and figures. We will also add a concise definition of the Plausibility metric together with a reference to its full formulation and evaluation procedure in the main text. revision: yes
-
Referee: [§3] §3 (Method): the central accuracy claim rests on the untested assumption that uniform average-pooling of patch embeddings plus a sparsity loss will preserve (or exceed) the discriminative power already encoded in the learned CLS token of a frozen DINOv2 backbone; no ablation isolating the pooling operator versus a CLS-based adapter is reported.
Authors: The referee correctly identifies the lack of an explicit ablation isolating average-pooling from a CLS-token adapter. While our design is motivated by the requirement for spatial localization (which the CLS token cannot support), we acknowledge that a direct comparison would strengthen the evidence. We will add such an ablation study in the revised version. revision: yes
-
Referee: [§4] §4 (Experiments): the sparsity-loss weight and QPM regularization parameters are listed as free hyperparameters; without an explicit statement that they were selected on a held-out validation split separate from the reported test sets, the reported superiority on both accuracy and interpretability metrics risks being inflated by post-hoc tuning.
Authors: The hyperparameters were selected via tuning on held-out validation splits that are disjoint from the reported test sets. We will make this procedure explicit in the revised Experiments section, including the specific validation protocol and selected values. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external baselines and standard hyperparameter practice
full rationale
The paper proposes an adapter method (average-pooling + sparsity loss on frozen DINOv2) and evaluates it empirically against DINOv2 linear probe and other methods using introduced metrics. No equations, derivations, or uniqueness theorems are presented that reduce to fitted parameters or self-citations by construction. Hyperparameter choices (e.g., sparsity weight) follow standard validation-set tuning and do not force the reported accuracy or plausibility gains. The central claims remain falsifiable via the external comparisons shown in the experiments.
Axiom & Free-Parameter Ledger
free parameters (2)
- sparsity_loss_weight
- QPM_regularization_parameters
axioms (1)
- domain assumption Average pooling of patch embeddings preserves sufficient spatial information for global interpretability.
Reference graph
Works this paper leans on
-
[1]
https://www.allaboutbirds.org/guide/Rusty Blackbird/id
Rusty Blackbird Identification, All About Birds, Cornell Lab of Ornithology. https://www.allaboutbirds.org/guide/Rusty Blackbird/id. 5
-
[2]
Quantifying Attention Flow in Transformers, 2020
Samira Abnar and Willem Zuidema. Quantifying Attention Flow in Transformers, 2020. 3
work page 2020
- [3]
-
[4]
Hubert Baniecki and Przemyslaw Biecek. Birds look like cars: adversarial analysis of intrinsically interpretable deep learning.Machine Learning, 114(12), 2025. 2
work page 2025
-
[5]
B-cos Net- works: Alignment is All We Need for Interpretability
Moritz Bohle, Mario Fritz, and Bernt Schiele. B-cos Net- works: Alignment is All We Need for Interpretability. In 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 10319–10328, New Or- leans, LA, USA, 2022. IEEE. 3
work page 2022
-
[6]
B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers, 2024
Moritz B ¨ohle, Navdeeppal Singh, Mario Fritz, and Bernt Schiele. B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers, 2024. 3
work page 2024
-
[7]
Class-Discriminative Attention Maps for Vision Transform- ers, 2024
Lennart Brocki, Jakub Binda, and Neo Christopher Chung. Class-Discriminative Attention Maps for Vision Transform- ers, 2024. 2, 3
work page 2024
-
[8]
Di- noV1: Emerging Properties in Self-Supervised Vision Trans- formers, 2021
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Di- noV1: Emerging Properties in Self-Supervised Vision Trans- formers, 2021. 2, 4, 5
work page 2021
-
[9]
Transformer Inter- pretability Beyond Attention Visualization, 2021
Hila Chefer, Shir Gur, and Lior Wolf. Transformer Inter- pretability Beyond Attention Visualization, 2021. 2, 3
work page 2021
-
[10]
This Looks Like That: Deep Learning for Interpretable Image Recognition, 2019
Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, and Cynthia Rudin. This Looks Like That: Deep Learning for Interpretable Image Recognition, 2019. 1
work page 2019
-
[11]
A Simple Framework for Contrastive Learn- ing of Visual Representations, 2020
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A Simple Framework for Contrastive Learn- ing of Visual Representations, 2020. 4, 7, 12
work page 2020
-
[12]
Context Autoencoder for Self- Supervised Representation Learning, 2023
Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context Autoencoder for Self- Supervised Representation Learning, 2023. 2
work page 2023
-
[13]
Evaluating Visual Explanations of Atten- tion Maps for Transformer-based Medical Imaging
Minjae Chung, Jong Bum Won, Ganghyun Kim, Yujin Kim, and Utku Ozbulak. Evaluating Visual Explanations of Atten- tion Maps for Transformer-based Medical Imaging. pages 110–120. 2025. 2
work page 2025
-
[14]
Learning to Esti- mate Shapley Values with Vision Transformers, 2023
Ian Covert, Chanwoo Kim, and Su-In Lee. Learning to Esti- mate Shapley Values with Vision Transformers, 2023. 3
work page 2023
-
[15]
Beilei Cui, Mobarakol Islam, Long Bai, and Hongliang Ren. Surgical-dino: Adapter learning of foundation models for depth estimation in endoscopic surgery, 2024. 1, 4
work page 2024
-
[16]
Vision Transformers Need Registers, 2024
Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers Need Registers, 2024. 4, 5
work page 2024
-
[17]
Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Kon- stantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky. The Road Less Scheduled, 2024. 2
work page 2024
-
[18]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021. 2
work page 2021
-
[19]
CUB-200-2011 Segmentations, 2022
Ryan Farrell. CUB-200-2011 Segmentations, 2022. Seg- mentation masks for the CUB-200-2011 dataset. 6
work page 2011
-
[20]
Patrick Glandorf and Bodo Rosenhahn. Pruning by block benefit: Exploring the properties of vision transformer blocks during domain adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3006–3016, 2025. 3
work page 2025
-
[21]
Patrick Glandorf, Timo Kaiser, and Bodo Rosenhahn. Hy- persparse neural networks: Shifting exploration to exploita- tion through adaptive regularization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1234–1243, 2023. 3
work page 2023
-
[22]
Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised Learning, 2020. 4
work page 2020
-
[23]
Gurobi optimizer reference manual, 2024
Gurobi Optimization, LLC. Gurobi optimizer reference manual, 2024. 2
work page 2024
-
[24]
R. Hadsell, S. Chopra, and Y . LeCun. Dimensionality Re- duction by Learning an Invariant Mapping. In2006 IEEE Computer Society Conference on Computer Vision and Pat- tern Recognition (CVPR’06), pages 1735–1742, 2006. 4
work page 2006
-
[25]
Momentum Contrast for Unsupervised Visual Rep- resentation Learning, 2020
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum Contrast for Unsupervised Visual Rep- resentation Learning, 2020. 4
work page 2020
-
[26]
does it? shortcomings of latent space prototype interpretability in deep networks
Adrian Hoffmann, Claudio Fanconi, Rahul Rade, and Jonas Kohler. This looks like that... does it? shortcomings of la- tent space prototype interpretability in deep networks.CoRR, abs/2105.02968, 2021. 2
-
[27]
Henry F. Inman and Edwin L. Bradley Jr. The overlapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two nor- mal densities.Communications in Statistics - Theory and Methods, 18(10):3851–3874, 1989. 6, 2
work page 1989
-
[28]
Optimal transport aggre- gation for visual place recognition, 2023
Sergio Izquierdo and Javier Civera. Optimal transport aggre- gation for visual place recognition, 2023. 1, 4
work page 2023
-
[29]
Alon Jacovi and Yoav Goldberg. Towards Faithfully Inter- pretable NLP Systems: How should we define and evaluate faithfulness?, 2020. 6
work page 2020
-
[30]
Sarthak Jain and Byron C. Wallace. Attention is not Expla- nation, 2019. 2
work page 2019
-
[31]
Uncertainsam: Fast and efficient uncertainty quantification of the segment anything model
Timo Kaiser, Thomas Norrenbrock, and Bodo Rosenhahn. Uncertainsam: Fast and efficient uncertainty quantification of the segment anything model. InForty-second Interna- tional Conference on Machine Learning. 1
-
[32]
Explainability of Vision Transformers: A Comprehensive Review and New Perspectives, 2023
Rojina Kashefi, Leili Barekatain, Mohammad Sabokrou, and Fatemeh Aghaeipoor. Explainability of Vision Transformers: A Comprehensive Review and New Perspectives, 2023. 3
work page 2023
-
[33]
Sunnie S. Y . Kim, Nicole Meister, Vikram V . Ramaswamy, Ruth Fong, and Olga Russakovsky. Hive: Evaluating the human interpretability of visual explanations, 2022. 2
work page 2022
-
[34]
3D object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. In 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13), Sydney, Australia, 2013. 5
work page 2013
-
[35]
Contrastive explanation.Royal Institute of Phi- losophy Supplement, 27:247–266, 1990
Peter Lipton. Contrastive explanation.Royal Institute of Phi- losophy Supplement, 27:247–266, 1990. 2, 6
work page 1990
-
[36]
Data or Language Supervision: What Makes CLIP Better than DINO?, 2025
Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, and Serena Yeung-Levy. Data or Language Supervision: What Makes CLIP Better than DINO?, 2025. 8
work page 2025
-
[37]
Interpretable Image Classification with Adaptive Prototype-based Vision Trans- formers
Chiyu Ma, Jon Donnelly, Wenjun Liu, Soroush V osoughi, Cynthia Rudin, and Chaofan Chen. Interpretable Image Classification with Adaptive Prototype-based Vision Trans- formers. 2024. 1, 3
work page 2024
-
[38]
A Closer Look at Benchmarking Self-Supervised Pre- training with Image Classification, 2024
Markus Marks, Manuel Knott, Neehar Kondapaneni, Elijah Cole, Thijs Defraeye, Fernando Perez-Cruz, and Pietro Per- ona. A Closer Look at Benchmarking Self-Supervised Pre- training with Image Classification, 2024. 8
work page 2024
-
[39]
Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial Intelligence, 267:1–38,
-
[40]
Neural Prototype Trees for Interpretable Fine-grained Image Recog- nition, 2021
Meike Nauta, Ron van Bree, and Christin Seifert. Neural Prototype Trees for Interpretable Fine-grained Image Recog- nition, 2021. 1
work page 2021
-
[41]
Take 5: Interpretable Image Classification with a Handful of Features, 2023
Thomas Norrenbrock, Marco Rudolph, and Bodo Rosen- hahn. Take 5: Interpretable Image Classification with a Handful of Features, 2023. 2, 3, 5, 7, 1, 12
work page 2023
-
[42]
Thomas Norrenbrock, Marco Rudolph, and Bodo Rosen- hahn. Q-SENN: Quantized Self-Explaining Neural Net- works.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21482–21491, 2024. 7, 1, 12
work page 2024
-
[43]
CHiQPM: Calibrated Hierarchical Interpretable Image Clas- sification, 2025
Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Nesli- han Kose, Ramesh Manuvinakurike, and Bodo Rosenhahn. CHiQPM: Calibrated Hierarchical Interpretable Image Clas- sification, 2025. 2
work page 2025
-
[44]
QPM: Discrete Op- timization for Globally Interpretable Image Classification,
Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Ramesh Manuvinakurike, and Bodo Rosenhahn. QPM: Discrete Op- timization for Globally Interpretable Image Classification,
-
[45]
2, 3, 4, 5, 6, 7, 8, 1, 12
-
[46]
Tuomas Oikarinen, Subhro Das, Lam M. Nguyen, and Tsui- Wei Weng. Label-Free Concept Bottleneck Models, 2023. 2, 3
work page 2023
-
[47]
DINOv2: Learning Robust Visual Features without Supervision, 2024
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...
work page 2024
-
[48]
IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers
Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers. InAdvances in Neural Information Process- ing Systems, pages 24898–24911. Curran Associates, Inc.,
-
[49]
PyTorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-per...
work page 2019
-
[50]
DCBM: Data-Efficient Visual Concept Bottleneck Models, 2025
Katharina Prasse, Patrick Knab, Sascha Marton, Christian Bartelt, and Margret Keuper. DCBM: Data-Efficient Visual Concept Bottleneck Models, 2025. 3
work page 2025
-
[51]
Stephen J. Read and Amy Marcus-Newhall. Explanatory co- herence in social explanations: A parallel distributed pro- cessing account.Journal of Personality and Social Psychol- ogy, 65(3):429–447, 1993. 2, 6
work page 1993
-
[52]
Eleanor Rosch. Principles of categorization. InCognition and categorization, pages 27–48. Lawrence Erlbaum Asso- ciates, Hillsdale, NJ, 1978. 1
work page 1978
-
[53]
Bodo Rosenhahn. Optimization of sparsity-constrained neu- ral networks as a mixed integer linear program.Journal of Optimization Theory and Applications, 199(3):931–954,
-
[54]
Cynthia Rudin. Stop explaining black box machine learn- ing models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215,
-
[55]
ProtoPShare: Prototype Sharing for Interpretable Image Classification and Similarity Discovery
Dawid Rymarczyk, Łukasz Struski, Jacek Tabor, and Bartosz Zieli´nski. ProtoPShare: Prototype Sharing for Interpretable Image Classification and Similarity Discovery. InProceed- ings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1420–1430, 2021. 1
work page 2021
-
[56]
Com- mittee on the Status of Endangered Wildlife in Canada, Ot- tawa, 2006
Carl Savignac.COSEWIC Assessment and Status Report on the Rusty Blackbird, Euphagus Carolinus, in Canada. Com- mittee on the Status of Endangered Wildlife in Canada, Ot- tawa, 2006. 5
work page 2006
-
[57]
FaceNet: A Unified Embedding for Face Recognition and Clustering
Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015. 4
work page 2015
-
[58]
Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.International Journal of Computer Vision, 128(2):336–359, 2020. 6
work page 2020
-
[59]
Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...
work page 2025
-
[60]
Concept Bottleneck Large Language Models,
Chung-En Sun, Tuomas Oikarinen, Berk Ustun, and Tsui- Wei Weng. Concept Bottleneck Large Language Models,
-
[61]
ProtoS-ViT: Visual foundation models for sparse self-explainable classifications, 2024
Hugues Turb ´e, Mina Bjelogrlic, Gianmarco Mengaldo, and Christian Lovis. ProtoS-ViT: Visual foundation models for sparse self-explainable classifications, 2024. 1
work page 2024
-
[62]
Tell me why: Visual foundation models as self-explainable classifiers, 2025
Hugues Turb ´e, Mina Bjelogrlic, Gianmarco Mengaldo, and Christian Lovis. Tell me why: Visual foundation models as self-explainable classifiers, 2025. 1
work page 2025
-
[63]
Represen- tation Learning with Contrastive Predictive Coding, 2019
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen- tation Learning with Contrastive Predictive Coding, 2019. 4
work page 2019
-
[64]
Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Be- longie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 595–604,
-
[65]
C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-UCSD birds-200-2011 dataset. Technical Re- port CNS-TR-2011-001, California Institute of Technology,
work page 2011
-
[66]
Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification, 2023. 2
work page 2023
-
[67]
Post-hoc Concept Bottleneck Models, 2023
Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc Concept Bottleneck Models, 2023. 2, 3
work page 2023
-
[68]
Top-down neu- ral attention by excitation backprop
Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neu- ral attention by excitation backprop. InComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 543–559. Springer, 2016. 2, 6
work page 2016
-
[69]
Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola- nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence, 2023. 1, 4
work page 2023
-
[70]
Partially Shared Concept Bottleneck Models, 2025
Delong Zhao, Qiang Huang, Di Yan, Yiqun Sun, and Jun Yu. Partially Shared Concept Bottleneck Models, 2025. 3
work page 2025
-
[71]
Interpretable Image Classification via Non-parametric Part Prototype Learning, 2025
Zhijie Zhu, Lei Fan, Maurice Pagnucco, and Yang Song. Interpretable Image Classification via Non-parametric Part Prototype Learning, 2025. 1, 3 DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification Supplementary Material
work page 2025
-
[72]
[41] introduced the Feature Diversity Loss, hereafter referred to asL div
Feature Diversity Loss To reduce conceptual ambiguity between features, Norrenbrock et al. [41] introduced the Feature Diversity Loss, hereafter referred to asL div. The objective ofL div is to encourage the represen- tation of distinct, mutually independent concepts within the fea- tures, thereby enhancing the degree of model interpretability. Let i∈ I={...
-
[73]
Definition of Additional Interpretability Metrics To assess model interpretability, we apply several metrics follow- ing Norrenbrock et al. [41, 42, 44]. Since interpretability is mul- tifaceted, multiple metrics addressing distinct concepts are neces- sary. Throughout this section, we utilise the following notation for index sets:i∈ I={1, . . . , W f }an...
-
[74]
Implementation Details All input images are resized to224×224pixels and normalised according to the dataset mean values. Unless otherwise specified, the Multi-Layer Perceptron (MLP) consists of four layers featuring ReLU activation and batch nor- malisation. The number of features is set toN f = 512, and the number of neurons in the hidden layers isNhidde...
work page 2048
-
[75]
[41] and intro- duced in detail in Sec
Impact of Auxiliary Losses TheL div loss, as proposed by Norrenbrock et al. [41] and intro- duced in detail in Sec. 7, is analysed here. Fig. 11 illustrates the influence ofL div on accuracy and SID@5. Notably, increasing the weight of this loss has a strong positive correlation with SID@5. Hence, the lightweight interpretability adapter can be steered si...
-
[76]
Impact of MLP Depth Fig. 13 illustrates the accuracy plotted against the number of neu- rons in the MLP’s hidden layersNhidden. Small accuracy gains are observed up toN hidden = 2048, regardless of the number of fea- turesN f which is why we choseN hidden = 2048andN f = 512, obtaining optimal accuracy while minimising compactness. 50 60 70 80 90 SID@5 0 1...
work page 2048
-
[77]
Visualisations 12.1. Class Comparisons Hooded Oriole 44 0 34 36 41 1 Hooded Warbler Features Hooded Oriole Features Hooded Warbler Figure 14. Faithful global interpretability on CUB-2011: DINO-QPM autonomously discovers the 5 diverse, generalisable features for each class used to represent the Hooded Oriole and Hooded Warbler, completely without external ...
work page 2011
-
[78]
Detailed Results Method Local. Features Accuracy↑ Faithful.↑ SID@5↑Class-Indep.↑Contrast.↑ CUB CARS CUB CARS CUB CARS CUB CARS DINOv2ffrozCLSLinear Probe✗87.9 ±0.1 91.7±0.1 42.6±0.2 50.9±0.2 51.5±0.199.2±0.099.1±0.0 59.2±0.0 60.9±0.0 DenseFfroz ✓78.1±0.3 92.9±0.1 32.7±0.291.8±0.793.1±0.1 98.8 ±0.0 98.7 ±0.0 84.5±0.3 82.8±0.1 Resnet50 Baseline [44]✓83.9±0....
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.