DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification

Bodo Rosenhahn; Robert Zimmermann; Thomas Norrenbrock

arxiv: 2604.07166 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.HC· cs.LG

DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification

Robert Zimmermann , Thomas Norrenbrock , Bodo Rosenhahn This is my paper

Pith reviewed 2026-05-10 17:35 UTC · model grok-4.3

classification 💻 cs.CV cs.HCcs.LG

keywords DINOv2interpretabilityimage classificationadaptersparsity lossfrozen backbonequadratic programmingvisual foundation models

0 comments

The pith

DINO-QPM turns frozen DINOv2 patch embeddings into human-interpretable class-independent features while beating linear-probe accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DINO-QPM attaches a lightweight adapter based on the Quadratic Programming Enhanced Model to a frozen DINO backbone. It replaces the usual CLS token with average-pooled patch embeddings and adds a sparsity loss so that the resulting representations remain contrastive and spatially localizable. The adapter produces globally consistent explanations grounded in object parts rather than background. Experiments on standard benchmarks show higher classification accuracy than a DINOv2 linear probe together with better scores on a new Plausibility metric and other interpretability measures. The approach therefore brings QPM-style interpretability to visual foundation models without any backbone retraining.

Core claim

DINO-QPM adapts the Quadratic Programming Enhanced Model as a lightweight interpretability adapter for strictly frozen DINO backbones. By using average-pooling of patch embeddings instead of the CLS token and imposing a sparsity loss that minimizes spatial scatter, it converts entangled high-dimensional features into contrastive, class-independent representations that support human-plausible global explanations and direct spatial localization in the input image.

What carries the argument

QPM adapter applied to average-pooled patch embeddings from a frozen DINO backbone, regularized by a sparsity loss that reduces background noise.

If this is right

Classification accuracy exceeds that of a standard DINOv2 linear probe.
Explanations become globally consistent and spatially localizable because patch embeddings connect directly to input space.
A sparsity loss forces explanations to focus on relevant object parts instead of background noise.
The full interpretability level previously available only with QPM becomes usable as a plug-in adapter for any frozen visual foundation model.
The method outperforms other applicable techniques for frozen backbones on both accuracy and explanation-quality metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adapter pattern could be tested on other frozen foundation models such as CLIP or larger ViT variants to check generalization.
The sparsity-driven focus on object parts may reduce reliance on spurious correlations that plague many post-hoc explanation techniques.
Standardized use of a Plausibility metric could encourage consistent benchmarking of explanation quality across future interpretable-vision papers.
If the adapter preserves performance while adding interpretability, regulatory or safety-critical vision deployments could adopt it without retraining large backbones.

Load-bearing premise

Average-pooling patch embeddings plus a sparsity loss on a frozen backbone yields globally consistent human-plausible explanations without new fitting artifacts or loss of discriminative power.

What would settle it

If DINO-QPM classification accuracy falls below the DINOv2 linear probe on ImageNet or if its explanations fail to exceed competing methods on the introduced Plausibility metric, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2604.07166 by Bodo Rosenhahn, Robert Zimmermann, Thomas Norrenbrock.

**Figure 1.** Figure 1: Overview of our proposed DINO-QPM. The pipeline processes the (a) input image using the frozen backbone to produce patch [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Radar Plot demonstrating the quality of DINO-QPM [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Architecture of our proposed DINO-QPM. Patch embed [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of a Brewer’s Blackbird image with a Rusty Blackbird image. From the selected features [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Visualisation of the Plausibility metric on a sample from [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of the number of MLP layers (Nlayers) on classification accuracy. The plot compares the accurcy on CUB-2011 of using frozen patch-level feature maps (F froz) versus the global feature vector (f froz CLS) for both Dense and QPM 20 30 40 50 60 N Selected Features N∗ f 3 4 5 6 7 N Features per Class Nc f 87.7 88.0 88.0 87.6 87.5 88.1 88.4 88.3 88.1 88.2 88.1 88.3 88.4 88.3 88.2 87.8 88.5 88.1 88.2 88.… view at source ↗

**Figure 10.** Figure 10: Qualitative ablation of LL1-FM on the Gray-crowned Rosy Finch. Without LL1-FM (top row), feature activations exhibit background noise and spatial scatter. Adding LL1-FM (bottom row) suppresses this noise, resulting in distinct activations semantically localised to specific object parts. pactness, e.g. using our Compact Model in Tab. 1. A significant increase in model accuracy which highly correlates with … view at source ↗

**Figure 11.** Figure 11: Accuracy and Feature Diversity (SID@5) on CUB-2011 across variations of the [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Impact of the LL1-FV on CUB-2011 accuracy during finetuning. 500 1000 1500 2000 2500 3000 3500 4000 Nhidden 87.6 87.8 88.0 88.2 88.4 Accuracy 1024 2048 4096 512 [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Mean finetuning accuracy on CUB-2011 for various numbers of features [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Faithful global interpretability on CUB-2011: DINO-QPM autonomously discovers the 5 diverse, generalisable features for each [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Faithful global interpretability on CUB-2011: DINO-QPM autonomously discovers the 5 diverse, generalisable features for [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Faithful global interpretability on Stanford Cars: DINO-QPM autonomously discovers the 5 diverse, generalisable features [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: Faithful global interpretability on Stanford Cars: DINO-QPM autonomously discovers the 5 diverse, generalisable features [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

**Figure 18.** Figure 18: Comparison on the Orchard Oriole (CUB-2011). We show test samples where our DINO-QPM correctly classifies the image while the dense baseline fails. Columns from left to right: original image (X), GradCAM activation map of the dense model, the most similar training sample from the dense-predicted class alongside its cosine similarity score (sim = maxs∈Scpred CosSim(f froz CLS(X), f froz CLS(s))), and the … view at source ↗

**Figure 19.** Figure 19: Comparison on the Louisiana Waterthrush (CUB-2011). We show test samples where our DINO-QPM correctly classifies the image while the dense baseline fails. Columns from left to right: original image (X), GradCAM activation map of the dense model, the most similar training sample from the dense-predicted class alongside its cosine similarity score (sim = maxs∈Scpred CosSim(f froz CLS(X), f froz CLS(s))), a… view at source ↗

**Figure 20.** Figure 20: Comparison on the Summer Tanager (CUB-2011). We compare the dense baseline and DINO-QPM on eight test images, correctly classified by both models. Each sample is shown as a triplet: the original image (X), the GradCAM attribution of the dense model, and the local explanation of DINO-QPM for the true class ctrue. The GradCAM attributions of the dense model frequently spread across the background or miss th… view at source ↗

**Figure 21.** Figure 21: Indigo Bunting samples from the CUB-2011 test set. We compare the dense baseline and DINO-QPM on eight test images, [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗

**Figure 22.** Figure 22: Failure analysis on the Green Kingfisher (CUB-2011). Comparison of test samples misclassified by both models. Columns (left to right): original image (X); dense model GradCAM; DINO-QPM local explanations for both the true and predicted classes; and the nearest training sample from the predicted class with its cosine similarity (sim = maxs∈Scpred CosSim(f froz CLS(X), f froz CLS(s))). Although both models … view at source ↗

**Figure 23.** Figure 23: Failure analysis on the Least Flycatcher (CUB-2011). We show test samples where the dense baseline classifies correctly but DINO-QPM does not. Columns from left to right: original image (X), GradCAM attribution of the dense model, DINO-QPM local explanations for the true and predicted classes, and the most similar training sample from the predicted class with its cosine similarity score (sim = maxs∈Scpred… view at source ↗

read the original abstract

Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the \texttt{CLS} token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model's features and therefore enable spatial localisation of DINO-QPM's globally interpretable features within the input space. Furthermore, we apply a sparsity loss to minimise spatial scatter and background noise, ensuring that explanations are grounded in relevant object parts. With DINO-QPM we make the level of interpretability of QPM available as an adapter while exceeding the accuracy of DINOv2 linear probe. Evaluated through an introduced Plausibility metric and other interpretability metrics, extensive experiments demonstrate that DINO-QPM is superior to other applicable methods for frozen visual foundation models in both classification accuracy and explanation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes DINO-QPM, a lightweight adapter for strictly frozen DINOv2 backbones that replaces the CLS token with average-pooling over patch embeddings and adds a sparsity loss. This is claimed to convert entangled foundation-model features into contrastive, globally interpretable representations while exceeding the accuracy of a DINOv2 linear probe and improving explanation quality as measured by a newly introduced Plausibility metric together with other interpretability scores.

Significance. If the accuracy claim and the validity of the Plausibility metric are substantiated, the work would be significant for extending globally interpretable methods such as QPM to modern frozen visual foundation models without backbone retraining. The adapter design and explicit spatial-localization goal address a practical gap between high-performance feature extractors and human-plausible explanations.

major comments (3)

[Abstract] Abstract: the headline claim that DINO-QPM exceeds DINOv2 linear-probe accuracy is stated without any numerical values, standard deviations, or references to specific tables or figures; the same paragraph introduces the Plausibility metric but supplies no definition, formula, or validation procedure against human judgments.
[§3] §3 (Method): the central accuracy claim rests on the untested assumption that uniform average-pooling of patch embeddings plus a sparsity loss will preserve (or exceed) the discriminative power already encoded in the learned CLS token of a frozen DINOv2 backbone; no ablation isolating the pooling operator versus a CLS-based adapter is reported.
[§4] §4 (Experiments): the sparsity-loss weight and QPM regularization parameters are listed as free hyperparameters; without an explicit statement that they were selected on a held-out validation split separate from the reported test sets, the reported superiority on both accuracy and interpretability metrics risks being inflated by post-hoc tuning.

minor comments (2)

[§3] Notation for the QPM adapter layers and the exact form of the sparsity loss could be clarified with an equation block or pseudocode.
[§4] Figure captions should explicitly state the backbone, dataset split, and whether the backbone is frozen when presenting qualitative explanation maps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that DINO-QPM exceeds DINOv2 linear-probe accuracy is stated without any numerical values, standard deviations, or references to specific tables or figures; the same paragraph introduces the Plausibility metric but supplies no definition, formula, or validation procedure against human judgments.

Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript, we will include numerical accuracy improvements (with standard deviations) and direct references to the relevant tables and figures. We will also add a concise definition of the Plausibility metric together with a reference to its full formulation and evaluation procedure in the main text. revision: yes
Referee: [§3] §3 (Method): the central accuracy claim rests on the untested assumption that uniform average-pooling of patch embeddings plus a sparsity loss will preserve (or exceed) the discriminative power already encoded in the learned CLS token of a frozen DINOv2 backbone; no ablation isolating the pooling operator versus a CLS-based adapter is reported.

Authors: The referee correctly identifies the lack of an explicit ablation isolating average-pooling from a CLS-token adapter. While our design is motivated by the requirement for spatial localization (which the CLS token cannot support), we acknowledge that a direct comparison would strengthen the evidence. We will add such an ablation study in the revised version. revision: yes
Referee: [§4] §4 (Experiments): the sparsity-loss weight and QPM regularization parameters are listed as free hyperparameters; without an explicit statement that they were selected on a held-out validation split separate from the reported test sets, the reported superiority on both accuracy and interpretability metrics risks being inflated by post-hoc tuning.

Authors: The hyperparameters were selected via tuning on held-out validation splits that are disjoint from the reported test sets. We will make this procedure explicit in the revised Experiments section, including the specific validation protocol and selected values. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external baselines and standard hyperparameter practice

full rationale

The paper proposes an adapter method (average-pooling + sparsity loss on frozen DINOv2) and evaluates it empirically against DINOv2 linear probe and other methods using introduced metrics. No equations, derivations, or uniqueness theorems are presented that reduce to fitted parameters or self-citations by construction. Hyperparameter choices (e.g., sparsity weight) follow standard validation-set tuning and do not force the reported accuracy or plausibility gains. The central claims remain falsifiable via the external comparisons shown in the experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The adapter relies on the assumption that DINO patch embeddings contain spatially meaningful information that can be linearly combined after average pooling. No new entities are postulated. Hyperparameters of the sparsity loss and QPM solver are free parameters whose values are not reported in the abstract.

free parameters (2)

sparsity_loss_weight
Controls the trade-off between classification accuracy and spatial focus; value not stated in abstract.
QPM_regularization_parameters
Inherited from the original QPM formulation; tuning details absent.

axioms (1)

domain assumption Average pooling of patch embeddings preserves sufficient spatial information for global interpretability.
Invoked when the paper states that average-pooling enables spatial localisation within the input space.

pith-pipeline@v0.9.0 · 5542 in / 1405 out tokens · 55161 ms · 2026-05-10T17:35:58.077431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages

[1]

https://www.allaboutbirds.org/guide/Rusty Blackbird/id

Rusty Blackbird Identification, All About Birds, Cornell Lab of Ornithology. https://www.allaboutbirds.org/guide/Rusty Blackbird/id. 5

work page
[2]

Quantifying Attention Flow in Transformers, 2020

Samira Abnar and Willem Zuidema. Quantifying Attention Flow in Transformers, 2020. 3

work page 2020
[3]

Jaakkola

David Alvarez-Melis and Tommi S. Jaakkola. Towards ro- bust interpretability with self-explaining neural networks. CoRR, abs/1806.07538, 2018. 2, 6

work page arXiv 2018
[4]

Birds look like cars: adversarial analysis of intrinsically interpretable deep learning.Machine Learning, 114(12), 2025

Hubert Baniecki and Przemyslaw Biecek. Birds look like cars: adversarial analysis of intrinsically interpretable deep learning.Machine Learning, 114(12), 2025. 2

work page 2025
[5]

B-cos Net- works: Alignment is All We Need for Interpretability

Moritz Bohle, Mario Fritz, and Bernt Schiele. B-cos Net- works: Alignment is All We Need for Interpretability. In 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 10319–10328, New Or- leans, LA, USA, 2022. IEEE. 3

work page 2022
[6]

B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers, 2024

Moritz B ¨ohle, Navdeeppal Singh, Mario Fritz, and Bernt Schiele. B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers, 2024. 3

work page 2024
[7]

Class-Discriminative Attention Maps for Vision Transform- ers, 2024

Lennart Brocki, Jakub Binda, and Neo Christopher Chung. Class-Discriminative Attention Maps for Vision Transform- ers, 2024. 2, 3

work page 2024
[8]

Di- noV1: Emerging Properties in Self-Supervised Vision Trans- formers, 2021

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Di- noV1: Emerging Properties in Self-Supervised Vision Trans- formers, 2021. 2, 4, 5

work page 2021
[9]

Transformer Inter- pretability Beyond Attention Visualization, 2021

Hila Chefer, Shir Gur, and Lior Wolf. Transformer Inter- pretability Beyond Attention Visualization, 2021. 2, 3

work page 2021
[10]

This Looks Like That: Deep Learning for Interpretable Image Recognition, 2019

Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, and Cynthia Rudin. This Looks Like That: Deep Learning for Interpretable Image Recognition, 2019. 1

work page 2019
[11]

A Simple Framework for Contrastive Learn- ing of Visual Representations, 2020

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A Simple Framework for Contrastive Learn- ing of Visual Representations, 2020. 4, 7, 12

work page 2020
[12]

Context Autoencoder for Self- Supervised Representation Learning, 2023

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context Autoencoder for Self- Supervised Representation Learning, 2023. 2

work page 2023
[13]

Evaluating Visual Explanations of Atten- tion Maps for Transformer-based Medical Imaging

Minjae Chung, Jong Bum Won, Ganghyun Kim, Yujin Kim, and Utku Ozbulak. Evaluating Visual Explanations of Atten- tion Maps for Transformer-based Medical Imaging. pages 110–120. 2025. 2

work page 2025
[14]

Learning to Esti- mate Shapley Values with Vision Transformers, 2023

Ian Covert, Chanwoo Kim, and Su-In Lee. Learning to Esti- mate Shapley Values with Vision Transformers, 2023. 3

work page 2023
[15]

Surgical-dino: Adapter learning of foundation models for depth estimation in endoscopic surgery, 2024

Beilei Cui, Mobarakol Islam, Long Bai, and Hongliang Ren. Surgical-dino: Adapter learning of foundation models for depth estimation in endoscopic surgery, 2024. 1, 4

work page 2024
[16]

Vision Transformers Need Registers, 2024

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers Need Registers, 2024. 4, 5

work page 2024
[17]

The Road Less Scheduled, 2024

Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Kon- stantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky. The Road Less Scheduled, 2024. 2

work page 2024
[18]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021. 2

work page 2021
[19]

CUB-200-2011 Segmentations, 2022

Ryan Farrell. CUB-200-2011 Segmentations, 2022. Seg- mentation masks for the CUB-200-2011 dataset. 6

work page 2011
[20]

Pruning by block benefit: Exploring the properties of vision transformer blocks during domain adaptation

Patrick Glandorf and Bodo Rosenhahn. Pruning by block benefit: Exploring the properties of vision transformer blocks during domain adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3006–3016, 2025. 3

work page 2025
[21]

Hy- persparse neural networks: Shifting exploration to exploita- tion through adaptive regularization

Patrick Glandorf, Timo Kaiser, and Bodo Rosenhahn. Hy- persparse neural networks: Shifting exploration to exploita- tion through adaptive regularization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1234–1243, 2023. 3

work page 2023
[22]

Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised Learning, 2020. 4

work page 2020
[23]

Gurobi optimizer reference manual, 2024

Gurobi Optimization, LLC. Gurobi optimizer reference manual, 2024. 2

work page 2024
[24]

Hadsell, S

R. Hadsell, S. Chopra, and Y . LeCun. Dimensionality Re- duction by Learning an Invariant Mapping. In2006 IEEE Computer Society Conference on Computer Vision and Pat- tern Recognition (CVPR’06), pages 1735–1742, 2006. 4

work page 2006
[25]

Momentum Contrast for Unsupervised Visual Rep- resentation Learning, 2020

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum Contrast for Unsupervised Visual Rep- resentation Learning, 2020. 4

work page 2020
[26]

does it? shortcomings of latent space prototype interpretability in deep networks

Adrian Hoffmann, Claudio Fanconi, Rahul Rade, and Jonas Kohler. This looks like that... does it? shortcomings of la- tent space prototype interpretability in deep networks.CoRR, abs/2105.02968, 2021. 2

work page arXiv 2021
[27]

Inman and Edwin L

Henry F. Inman and Edwin L. Bradley Jr. The overlapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two nor- mal densities.Communications in Statistics - Theory and Methods, 18(10):3851–3874, 1989. 6, 2

work page 1989
[28]

Optimal transport aggre- gation for visual place recognition, 2023

Sergio Izquierdo and Javier Civera. Optimal transport aggre- gation for visual place recognition, 2023. 1, 4

work page 2023
[29]

Towards Faithfully Inter- pretable NLP Systems: How should we define and evaluate faithfulness?, 2020

Alon Jacovi and Yoav Goldberg. Towards Faithfully Inter- pretable NLP Systems: How should we define and evaluate faithfulness?, 2020. 6

work page 2020
[30]

Sarthak Jain and Byron C. Wallace. Attention is not Expla- nation, 2019. 2

work page 2019
[31]

Uncertainsam: Fast and efficient uncertainty quantification of the segment anything model

Timo Kaiser, Thomas Norrenbrock, and Bodo Rosenhahn. Uncertainsam: Fast and efficient uncertainty quantification of the segment anything model. InForty-second Interna- tional Conference on Machine Learning. 1

work page
[32]

Explainability of Vision Transformers: A Comprehensive Review and New Perspectives, 2023

Rojina Kashefi, Leili Barekatain, Mohammad Sabokrou, and Fatemeh Aghaeipoor. Explainability of Vision Transformers: A Comprehensive Review and New Perspectives, 2023. 3

work page 2023
[33]

Sunnie S. Y . Kim, Nicole Meister, Vikram V . Ramaswamy, Ruth Fong, and Olga Russakovsky. Hive: Evaluating the human interpretability of visual explanations, 2022. 2

work page 2022
[34]

3D object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. In 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13), Sydney, Australia, 2013. 5

work page 2013
[35]

Contrastive explanation.Royal Institute of Phi- losophy Supplement, 27:247–266, 1990

Peter Lipton. Contrastive explanation.Royal Institute of Phi- losophy Supplement, 27:247–266, 1990. 2, 6

work page 1990
[36]

Data or Language Supervision: What Makes CLIP Better than DINO?, 2025

Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, and Serena Yeung-Levy. Data or Language Supervision: What Makes CLIP Better than DINO?, 2025. 8

work page 2025
[37]

Interpretable Image Classification with Adaptive Prototype-based Vision Trans- formers

Chiyu Ma, Jon Donnelly, Wenjun Liu, Soroush V osoughi, Cynthia Rudin, and Chaofan Chen. Interpretable Image Classification with Adaptive Prototype-based Vision Trans- formers. 2024. 1, 3

work page 2024
[38]

A Closer Look at Benchmarking Self-Supervised Pre- training with Image Classification, 2024

Markus Marks, Manuel Knott, Neehar Kondapaneni, Elijah Cole, Thijs Defraeye, Fernando Perez-Cruz, and Pietro Per- ona. A Closer Look at Benchmarking Self-Supervised Pre- training with Image Classification, 2024. 8

work page 2024
[39]

Explanation in artificial intelligence: Insights from the social sciences.Artificial Intelligence, 267:1–38,

Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial Intelligence, 267:1–38,

work page
[40]

Neural Prototype Trees for Interpretable Fine-grained Image Recog- nition, 2021

Meike Nauta, Ron van Bree, and Christin Seifert. Neural Prototype Trees for Interpretable Fine-grained Image Recog- nition, 2021. 1

work page 2021
[41]

Take 5: Interpretable Image Classification with a Handful of Features, 2023

Thomas Norrenbrock, Marco Rudolph, and Bodo Rosen- hahn. Take 5: Interpretable Image Classification with a Handful of Features, 2023. 2, 3, 5, 7, 1, 12

work page 2023
[42]

Q-SENN: Quantized Self-Explaining Neural Net- works.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21482–21491, 2024

Thomas Norrenbrock, Marco Rudolph, and Bodo Rosen- hahn. Q-SENN: Quantized Self-Explaining Neural Net- works.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21482–21491, 2024. 7, 1, 12

work page 2024
[43]

CHiQPM: Calibrated Hierarchical Interpretable Image Clas- sification, 2025

Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Nesli- han Kose, Ramesh Manuvinakurike, and Bodo Rosenhahn. CHiQPM: Calibrated Hierarchical Interpretable Image Clas- sification, 2025. 2

work page 2025
[44]

QPM: Discrete Op- timization for Globally Interpretable Image Classification,

Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Ramesh Manuvinakurike, and Bodo Rosenhahn. QPM: Discrete Op- timization for Globally Interpretable Image Classification,

work page
[45]

2, 3, 4, 5, 6, 7, 8, 1, 12

work page
[46]

Nguyen, and Tsui- Wei Weng

Tuomas Oikarinen, Subhro Das, Lam M. Nguyen, and Tsui- Wei Weng. Label-Free Concept Bottleneck Models, 2023. 2, 3

work page 2023
[47]

DINOv2: Learning Robust Visual Features without Supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024
[48]

IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers

Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers. InAdvances in Neural Information Process- ing Systems, pages 24898–24911. Curran Associates, Inc.,

work page
[49]

PyTorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-per...

work page 2019
[50]

DCBM: Data-Efficient Visual Concept Bottleneck Models, 2025

Katharina Prasse, Patrick Knab, Sascha Marton, Christian Bartelt, and Margret Keuper. DCBM: Data-Efficient Visual Concept Bottleneck Models, 2025. 3

work page 2025
[51]

Read and Amy Marcus-Newhall

Stephen J. Read and Amy Marcus-Newhall. Explanatory co- herence in social explanations: A parallel distributed pro- cessing account.Journal of Personality and Social Psychol- ogy, 65(3):429–447, 1993. 2, 6

work page 1993
[52]

Principles of categorization

Eleanor Rosch. Principles of categorization. InCognition and categorization, pages 27–48. Lawrence Erlbaum Asso- ciates, Hillsdale, NJ, 1978. 1

work page 1978
[53]

Optimization of sparsity-constrained neu- ral networks as a mixed integer linear program.Journal of Optimization Theory and Applications, 199(3):931–954,

Bodo Rosenhahn. Optimization of sparsity-constrained neu- ral networks as a mixed integer linear program.Journal of Optimization Theory and Applications, 199(3):931–954,

work page
[54]

Stop explaining black box machine learn- ing models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215,

Cynthia Rudin. Stop explaining black box machine learn- ing models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215,

work page
[55]

ProtoPShare: Prototype Sharing for Interpretable Image Classification and Similarity Discovery

Dawid Rymarczyk, Łukasz Struski, Jacek Tabor, and Bartosz Zieli´nski. ProtoPShare: Prototype Sharing for Interpretable Image Classification and Similarity Discovery. InProceed- ings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1420–1430, 2021. 1

work page 2021
[56]

Com- mittee on the Status of Endangered Wildlife in Canada, Ot- tawa, 2006

Carl Savignac.COSEWIC Assessment and Status Report on the Rusty Blackbird, Euphagus Carolinus, in Canada. Com- mittee on the Status of Endangered Wildlife in Canada, Ot- tawa, 2006. 5

work page 2006
[57]

FaceNet: A Unified Embedding for Face Recognition and Clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015. 4

work page 2015
[58]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.International Journal of Computer Vision, 128(2):336–359, 2020. 6

work page 2020
[59]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

work page 2025
[60]

Concept Bottleneck Large Language Models,

Chung-En Sun, Tuomas Oikarinen, Berk Ustun, and Tsui- Wei Weng. Concept Bottleneck Large Language Models,

work page
[61]

ProtoS-ViT: Visual foundation models for sparse self-explainable classifications, 2024

Hugues Turb ´e, Mina Bjelogrlic, Gianmarco Mengaldo, and Christian Lovis. ProtoS-ViT: Visual foundation models for sparse self-explainable classifications, 2024. 1

work page 2024
[62]

Tell me why: Visual foundation models as self-explainable classifiers, 2025

Hugues Turb ´e, Mina Bjelogrlic, Gianmarco Mengaldo, and Christian Lovis. Tell me why: Visual foundation models as self-explainable classifiers, 2025. 1

work page 2025
[63]

Represen- tation Learning with Contrastive Predictive Coding, 2019

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen- tation Learning with Contrastive Predictive Coding, 2019. 4

work page 2019
[64]

Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Be- longie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 595–604,

work page
[65]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-UCSD birds-200-2011 dataset. Technical Re- port CNS-TR-2011-001, California Institute of Technology,

work page 2011
[66]

Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification, 2023

Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification, 2023. 2

work page 2023
[67]

Post-hoc Concept Bottleneck Models, 2023

Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc Concept Bottleneck Models, 2023. 2, 3

work page 2023
[68]

Top-down neu- ral attention by excitation backprop

Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neu- ral attention by excitation backprop. InComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 543–559. Springer, 2016. 2, 6

work page 2016
[69]

A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence, 2023

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola- nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence, 2023. 1, 4

work page 2023
[70]

Partially Shared Concept Bottleneck Models, 2025

Delong Zhao, Qiang Huang, Di Yan, Yiqun Sun, and Jun Yu. Partially Shared Concept Bottleneck Models, 2025. 3

work page 2025
[71]

Interpretable Image Classification via Non-parametric Part Prototype Learning, 2025

Zhijie Zhu, Lei Fan, Maurice Pagnucco, and Yang Song. Interpretable Image Classification via Non-parametric Part Prototype Learning, 2025. 1, 3 DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification Supplementary Material

work page 2025
[72]

[41] introduced the Feature Diversity Loss, hereafter referred to asL div

Feature Diversity Loss To reduce conceptual ambiguity between features, Norrenbrock et al. [41] introduced the Feature Diversity Loss, hereafter referred to asL div. The objective ofL div is to encourage the represen- tation of distinct, mutually independent concepts within the fea- tures, thereby enhancing the degree of model interpretability. Let i∈ I={...

work page
[73]

[41, 42, 44]

Definition of Additional Interpretability Metrics To assess model interpretability, we apply several metrics follow- ing Norrenbrock et al. [41, 42, 44]. Since interpretability is mul- tifaceted, multiple metrics addressing distinct concepts are neces- sary. Throughout this section, we utilise the following notation for index sets:i∈ I={1, . . . , W f }an...

work page
[74]

Unless otherwise specified, the Multi-Layer Perceptron (MLP) consists of four layers featuring ReLU activation and batch nor- malisation

Implementation Details All input images are resized to224×224pixels and normalised according to the dataset mean values. Unless otherwise specified, the Multi-Layer Perceptron (MLP) consists of four layers featuring ReLU activation and batch nor- malisation. The number of features is set toN f = 512, and the number of neurons in the hidden layers isNhidde...

work page 2048
[75]

[41] and intro- duced in detail in Sec

Impact of Auxiliary Losses TheL div loss, as proposed by Norrenbrock et al. [41] and intro- duced in detail in Sec. 7, is analysed here. Fig. 11 illustrates the influence ofL div on accuracy and SID@5. Notably, increasing the weight of this loss has a strong positive correlation with SID@5. Hence, the lightweight interpretability adapter can be steered si...

work page
[76]

13 illustrates the accuracy plotted against the number of neu- rons in the MLP’s hidden layersNhidden

Impact of MLP Depth Fig. 13 illustrates the accuracy plotted against the number of neu- rons in the MLP’s hidden layersNhidden. Small accuracy gains are observed up toN hidden = 2048, regardless of the number of fea- turesN f which is why we choseN hidden = 2048andN f = 512, obtaining optimal accuracy while minimising compactness. 50 60 70 80 90 SID@5 0 1...

work page 2048
[77]

Class Comparisons Hooded Oriole 44 0 34 36 41 1 Hooded Warbler Features Hooded Oriole Features Hooded Warbler Figure 14

Visualisations 12.1. Class Comparisons Hooded Oriole 44 0 34 36 41 1 Hooded Warbler Features Hooded Oriole Features Hooded Warbler Figure 14. Faithful global interpretability on CUB-2011: DINO-QPM autonomously discovers the 5 diverse, generalisable features for each class used to represent the Hooded Oriole and Hooded Warbler, completely without external ...

work page 2011
[78]

Detailed Results Method Local. Features Accuracy↑ Faithful.↑ SID@5↑Class-Indep.↑Contrast.↑ CUB CARS CUB CARS CUB CARS CUB CARS DINOv2ffrozCLSLinear Probe✗87.9 ±0.1 91.7±0.1 42.6±0.2 50.9±0.2 51.5±0.199.2±0.099.1±0.0 59.2±0.0 60.9±0.0 DenseFfroz ✓78.1±0.3 92.9±0.1 32.7±0.291.8±0.793.1±0.1 98.8 ±0.0 98.7 ±0.0 84.5±0.3 82.8±0.1 Resnet50 Baseline [44]✓83.9±0....

work page 2011

[1] [1]

https://www.allaboutbirds.org/guide/Rusty Blackbird/id

Rusty Blackbird Identification, All About Birds, Cornell Lab of Ornithology. https://www.allaboutbirds.org/guide/Rusty Blackbird/id. 5

work page

[2] [2]

Quantifying Attention Flow in Transformers, 2020

Samira Abnar and Willem Zuidema. Quantifying Attention Flow in Transformers, 2020. 3

work page 2020

[3] [3]

Jaakkola

David Alvarez-Melis and Tommi S. Jaakkola. Towards ro- bust interpretability with self-explaining neural networks. CoRR, abs/1806.07538, 2018. 2, 6

work page arXiv 2018

[4] [4]

Birds look like cars: adversarial analysis of intrinsically interpretable deep learning.Machine Learning, 114(12), 2025

Hubert Baniecki and Przemyslaw Biecek. Birds look like cars: adversarial analysis of intrinsically interpretable deep learning.Machine Learning, 114(12), 2025. 2

work page 2025

[5] [5]

B-cos Net- works: Alignment is All We Need for Interpretability

Moritz Bohle, Mario Fritz, and Bernt Schiele. B-cos Net- works: Alignment is All We Need for Interpretability. In 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 10319–10328, New Or- leans, LA, USA, 2022. IEEE. 3

work page 2022

[6] [6]

B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers, 2024

Moritz B ¨ohle, Navdeeppal Singh, Mario Fritz, and Bernt Schiele. B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers, 2024. 3

work page 2024

[7] [7]

Class-Discriminative Attention Maps for Vision Transform- ers, 2024

Lennart Brocki, Jakub Binda, and Neo Christopher Chung. Class-Discriminative Attention Maps for Vision Transform- ers, 2024. 2, 3

work page 2024

[8] [8]

Di- noV1: Emerging Properties in Self-Supervised Vision Trans- formers, 2021

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Di- noV1: Emerging Properties in Self-Supervised Vision Trans- formers, 2021. 2, 4, 5

work page 2021

[9] [9]

Transformer Inter- pretability Beyond Attention Visualization, 2021

Hila Chefer, Shir Gur, and Lior Wolf. Transformer Inter- pretability Beyond Attention Visualization, 2021. 2, 3

work page 2021

[10] [10]

This Looks Like That: Deep Learning for Interpretable Image Recognition, 2019

Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, and Cynthia Rudin. This Looks Like That: Deep Learning for Interpretable Image Recognition, 2019. 1

work page 2019

[11] [11]

A Simple Framework for Contrastive Learn- ing of Visual Representations, 2020

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A Simple Framework for Contrastive Learn- ing of Visual Representations, 2020. 4, 7, 12

work page 2020

[12] [12]

Context Autoencoder for Self- Supervised Representation Learning, 2023

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context Autoencoder for Self- Supervised Representation Learning, 2023. 2

work page 2023

[13] [13]

Evaluating Visual Explanations of Atten- tion Maps for Transformer-based Medical Imaging

Minjae Chung, Jong Bum Won, Ganghyun Kim, Yujin Kim, and Utku Ozbulak. Evaluating Visual Explanations of Atten- tion Maps for Transformer-based Medical Imaging. pages 110–120. 2025. 2

work page 2025

[14] [14]

Learning to Esti- mate Shapley Values with Vision Transformers, 2023

Ian Covert, Chanwoo Kim, and Su-In Lee. Learning to Esti- mate Shapley Values with Vision Transformers, 2023. 3

work page 2023

[15] [15]

Surgical-dino: Adapter learning of foundation models for depth estimation in endoscopic surgery, 2024

Beilei Cui, Mobarakol Islam, Long Bai, and Hongliang Ren. Surgical-dino: Adapter learning of foundation models for depth estimation in endoscopic surgery, 2024. 1, 4

work page 2024

[16] [16]

Vision Transformers Need Registers, 2024

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers Need Registers, 2024. 4, 5

work page 2024

[17] [17]

The Road Less Scheduled, 2024

Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Kon- stantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky. The Road Less Scheduled, 2024. 2

work page 2024

[18] [18]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021. 2

work page 2021

[19] [19]

CUB-200-2011 Segmentations, 2022

Ryan Farrell. CUB-200-2011 Segmentations, 2022. Seg- mentation masks for the CUB-200-2011 dataset. 6

work page 2011

[20] [20]

Pruning by block benefit: Exploring the properties of vision transformer blocks during domain adaptation

Patrick Glandorf and Bodo Rosenhahn. Pruning by block benefit: Exploring the properties of vision transformer blocks during domain adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3006–3016, 2025. 3

work page 2025

[21] [21]

Hy- persparse neural networks: Shifting exploration to exploita- tion through adaptive regularization

Patrick Glandorf, Timo Kaiser, and Bodo Rosenhahn. Hy- persparse neural networks: Shifting exploration to exploita- tion through adaptive regularization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1234–1243, 2023. 3

work page 2023

[22] [22]

Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised Learning, 2020. 4

work page 2020

[23] [23]

Gurobi optimizer reference manual, 2024

Gurobi Optimization, LLC. Gurobi optimizer reference manual, 2024. 2

work page 2024

[24] [24]

Hadsell, S

R. Hadsell, S. Chopra, and Y . LeCun. Dimensionality Re- duction by Learning an Invariant Mapping. In2006 IEEE Computer Society Conference on Computer Vision and Pat- tern Recognition (CVPR’06), pages 1735–1742, 2006. 4

work page 2006

[25] [25]

Momentum Contrast for Unsupervised Visual Rep- resentation Learning, 2020

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum Contrast for Unsupervised Visual Rep- resentation Learning, 2020. 4

work page 2020

[26] [26]

does it? shortcomings of latent space prototype interpretability in deep networks

Adrian Hoffmann, Claudio Fanconi, Rahul Rade, and Jonas Kohler. This looks like that... does it? shortcomings of la- tent space prototype interpretability in deep networks.CoRR, abs/2105.02968, 2021. 2

work page arXiv 2021

[27] [27]

Inman and Edwin L

Henry F. Inman and Edwin L. Bradley Jr. The overlapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two nor- mal densities.Communications in Statistics - Theory and Methods, 18(10):3851–3874, 1989. 6, 2

work page 1989

[28] [28]

Optimal transport aggre- gation for visual place recognition, 2023

Sergio Izquierdo and Javier Civera. Optimal transport aggre- gation for visual place recognition, 2023. 1, 4

work page 2023

[29] [29]

Towards Faithfully Inter- pretable NLP Systems: How should we define and evaluate faithfulness?, 2020

Alon Jacovi and Yoav Goldberg. Towards Faithfully Inter- pretable NLP Systems: How should we define and evaluate faithfulness?, 2020. 6

work page 2020

[30] [30]

Sarthak Jain and Byron C. Wallace. Attention is not Expla- nation, 2019. 2

work page 2019

[31] [31]

Uncertainsam: Fast and efficient uncertainty quantification of the segment anything model

Timo Kaiser, Thomas Norrenbrock, and Bodo Rosenhahn. Uncertainsam: Fast and efficient uncertainty quantification of the segment anything model. InForty-second Interna- tional Conference on Machine Learning. 1

work page

[32] [32]

Explainability of Vision Transformers: A Comprehensive Review and New Perspectives, 2023

Rojina Kashefi, Leili Barekatain, Mohammad Sabokrou, and Fatemeh Aghaeipoor. Explainability of Vision Transformers: A Comprehensive Review and New Perspectives, 2023. 3

work page 2023

[33] [33]

Sunnie S. Y . Kim, Nicole Meister, Vikram V . Ramaswamy, Ruth Fong, and Olga Russakovsky. Hive: Evaluating the human interpretability of visual explanations, 2022. 2

work page 2022

[34] [34]

3D object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. In 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13), Sydney, Australia, 2013. 5

work page 2013

[35] [35]

Contrastive explanation.Royal Institute of Phi- losophy Supplement, 27:247–266, 1990

Peter Lipton. Contrastive explanation.Royal Institute of Phi- losophy Supplement, 27:247–266, 1990. 2, 6

work page 1990

[36] [36]

Data or Language Supervision: What Makes CLIP Better than DINO?, 2025

Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, and Serena Yeung-Levy. Data or Language Supervision: What Makes CLIP Better than DINO?, 2025. 8

work page 2025

[37] [37]

Interpretable Image Classification with Adaptive Prototype-based Vision Trans- formers

Chiyu Ma, Jon Donnelly, Wenjun Liu, Soroush V osoughi, Cynthia Rudin, and Chaofan Chen. Interpretable Image Classification with Adaptive Prototype-based Vision Trans- formers. 2024. 1, 3

work page 2024

[38] [38]

A Closer Look at Benchmarking Self-Supervised Pre- training with Image Classification, 2024

Markus Marks, Manuel Knott, Neehar Kondapaneni, Elijah Cole, Thijs Defraeye, Fernando Perez-Cruz, and Pietro Per- ona. A Closer Look at Benchmarking Self-Supervised Pre- training with Image Classification, 2024. 8

work page 2024

[39] [39]

Explanation in artificial intelligence: Insights from the social sciences.Artificial Intelligence, 267:1–38,

Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial Intelligence, 267:1–38,

work page

[40] [40]

Neural Prototype Trees for Interpretable Fine-grained Image Recog- nition, 2021

Meike Nauta, Ron van Bree, and Christin Seifert. Neural Prototype Trees for Interpretable Fine-grained Image Recog- nition, 2021. 1

work page 2021

[41] [41]

Take 5: Interpretable Image Classification with a Handful of Features, 2023

Thomas Norrenbrock, Marco Rudolph, and Bodo Rosen- hahn. Take 5: Interpretable Image Classification with a Handful of Features, 2023. 2, 3, 5, 7, 1, 12

work page 2023

[42] [42]

Q-SENN: Quantized Self-Explaining Neural Net- works.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21482–21491, 2024

Thomas Norrenbrock, Marco Rudolph, and Bodo Rosen- hahn. Q-SENN: Quantized Self-Explaining Neural Net- works.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21482–21491, 2024. 7, 1, 12

work page 2024

[43] [43]

CHiQPM: Calibrated Hierarchical Interpretable Image Clas- sification, 2025

Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Nesli- han Kose, Ramesh Manuvinakurike, and Bodo Rosenhahn. CHiQPM: Calibrated Hierarchical Interpretable Image Clas- sification, 2025. 2

work page 2025

[44] [44]

QPM: Discrete Op- timization for Globally Interpretable Image Classification,

Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Ramesh Manuvinakurike, and Bodo Rosenhahn. QPM: Discrete Op- timization for Globally Interpretable Image Classification,

work page

[45] [45]

2, 3, 4, 5, 6, 7, 8, 1, 12

work page

[46] [46]

Nguyen, and Tsui- Wei Weng

Tuomas Oikarinen, Subhro Das, Lam M. Nguyen, and Tsui- Wei Weng. Label-Free Concept Bottleneck Models, 2023. 2, 3

work page 2023

[47] [47]

DINOv2: Learning Robust Visual Features without Supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024

[48] [48]

IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers

Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers. InAdvances in Neural Information Process- ing Systems, pages 24898–24911. Curran Associates, Inc.,

work page

[49] [49]

PyTorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-per...

work page 2019

[50] [50]

DCBM: Data-Efficient Visual Concept Bottleneck Models, 2025

Katharina Prasse, Patrick Knab, Sascha Marton, Christian Bartelt, and Margret Keuper. DCBM: Data-Efficient Visual Concept Bottleneck Models, 2025. 3

work page 2025

[51] [51]

Read and Amy Marcus-Newhall

Stephen J. Read and Amy Marcus-Newhall. Explanatory co- herence in social explanations: A parallel distributed pro- cessing account.Journal of Personality and Social Psychol- ogy, 65(3):429–447, 1993. 2, 6

work page 1993

[52] [52]

Principles of categorization

Eleanor Rosch. Principles of categorization. InCognition and categorization, pages 27–48. Lawrence Erlbaum Asso- ciates, Hillsdale, NJ, 1978. 1

work page 1978

[53] [53]

Optimization of sparsity-constrained neu- ral networks as a mixed integer linear program.Journal of Optimization Theory and Applications, 199(3):931–954,

Bodo Rosenhahn. Optimization of sparsity-constrained neu- ral networks as a mixed integer linear program.Journal of Optimization Theory and Applications, 199(3):931–954,

work page

[54] [54]

Stop explaining black box machine learn- ing models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215,

Cynthia Rudin. Stop explaining black box machine learn- ing models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215,

work page

[55] [55]

ProtoPShare: Prototype Sharing for Interpretable Image Classification and Similarity Discovery

Dawid Rymarczyk, Łukasz Struski, Jacek Tabor, and Bartosz Zieli´nski. ProtoPShare: Prototype Sharing for Interpretable Image Classification and Similarity Discovery. InProceed- ings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1420–1430, 2021. 1

work page 2021

[56] [56]

Com- mittee on the Status of Endangered Wildlife in Canada, Ot- tawa, 2006

Carl Savignac.COSEWIC Assessment and Status Report on the Rusty Blackbird, Euphagus Carolinus, in Canada. Com- mittee on the Status of Endangered Wildlife in Canada, Ot- tawa, 2006. 5

work page 2006

[57] [57]

FaceNet: A Unified Embedding for Face Recognition and Clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015. 4

work page 2015

[58] [58]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.International Journal of Computer Vision, 128(2):336–359, 2020. 6

work page 2020

[59] [59]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

work page 2025

[60] [60]

Concept Bottleneck Large Language Models,

Chung-En Sun, Tuomas Oikarinen, Berk Ustun, and Tsui- Wei Weng. Concept Bottleneck Large Language Models,

work page

[61] [61]

ProtoS-ViT: Visual foundation models for sparse self-explainable classifications, 2024

Hugues Turb ´e, Mina Bjelogrlic, Gianmarco Mengaldo, and Christian Lovis. ProtoS-ViT: Visual foundation models for sparse self-explainable classifications, 2024. 1

work page 2024

[62] [62]

Tell me why: Visual foundation models as self-explainable classifiers, 2025

Hugues Turb ´e, Mina Bjelogrlic, Gianmarco Mengaldo, and Christian Lovis. Tell me why: Visual foundation models as self-explainable classifiers, 2025. 1

work page 2025

[63] [63]

Represen- tation Learning with Contrastive Predictive Coding, 2019

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen- tation Learning with Contrastive Predictive Coding, 2019. 4

work page 2019

[64] [64]

Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Be- longie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 595–604,

work page

[65] [65]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-UCSD birds-200-2011 dataset. Technical Re- port CNS-TR-2011-001, California Institute of Technology,

work page 2011

[66] [66]

Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification, 2023

Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification, 2023. 2

work page 2023

[67] [67]

Post-hoc Concept Bottleneck Models, 2023

Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc Concept Bottleneck Models, 2023. 2, 3

work page 2023

[68] [68]

Top-down neu- ral attention by excitation backprop

Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neu- ral attention by excitation backprop. InComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 543–559. Springer, 2016. 2, 6

work page 2016

[69] [69]

A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence, 2023

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola- nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence, 2023. 1, 4

work page 2023

[70] [70]

Partially Shared Concept Bottleneck Models, 2025

Delong Zhao, Qiang Huang, Di Yan, Yiqun Sun, and Jun Yu. Partially Shared Concept Bottleneck Models, 2025. 3

work page 2025

[71] [71]

Interpretable Image Classification via Non-parametric Part Prototype Learning, 2025

Zhijie Zhu, Lei Fan, Maurice Pagnucco, and Yang Song. Interpretable Image Classification via Non-parametric Part Prototype Learning, 2025. 1, 3 DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification Supplementary Material

work page 2025

[72] [72]

[41] introduced the Feature Diversity Loss, hereafter referred to asL div

Feature Diversity Loss To reduce conceptual ambiguity between features, Norrenbrock et al. [41] introduced the Feature Diversity Loss, hereafter referred to asL div. The objective ofL div is to encourage the represen- tation of distinct, mutually independent concepts within the fea- tures, thereby enhancing the degree of model interpretability. Let i∈ I={...

work page

[73] [73]

[41, 42, 44]

Definition of Additional Interpretability Metrics To assess model interpretability, we apply several metrics follow- ing Norrenbrock et al. [41, 42, 44]. Since interpretability is mul- tifaceted, multiple metrics addressing distinct concepts are neces- sary. Throughout this section, we utilise the following notation for index sets:i∈ I={1, . . . , W f }an...

work page

[74] [74]

Unless otherwise specified, the Multi-Layer Perceptron (MLP) consists of four layers featuring ReLU activation and batch nor- malisation

Implementation Details All input images are resized to224×224pixels and normalised according to the dataset mean values. Unless otherwise specified, the Multi-Layer Perceptron (MLP) consists of four layers featuring ReLU activation and batch nor- malisation. The number of features is set toN f = 512, and the number of neurons in the hidden layers isNhidde...

work page 2048

[75] [75]

[41] and intro- duced in detail in Sec

Impact of Auxiliary Losses TheL div loss, as proposed by Norrenbrock et al. [41] and intro- duced in detail in Sec. 7, is analysed here. Fig. 11 illustrates the influence ofL div on accuracy and SID@5. Notably, increasing the weight of this loss has a strong positive correlation with SID@5. Hence, the lightweight interpretability adapter can be steered si...

work page

[76] [76]

13 illustrates the accuracy plotted against the number of neu- rons in the MLP’s hidden layersNhidden

Impact of MLP Depth Fig. 13 illustrates the accuracy plotted against the number of neu- rons in the MLP’s hidden layersNhidden. Small accuracy gains are observed up toN hidden = 2048, regardless of the number of fea- turesN f which is why we choseN hidden = 2048andN f = 512, obtaining optimal accuracy while minimising compactness. 50 60 70 80 90 SID@5 0 1...

work page 2048

[77] [77]

Class Comparisons Hooded Oriole 44 0 34 36 41 1 Hooded Warbler Features Hooded Oriole Features Hooded Warbler Figure 14

Visualisations 12.1. Class Comparisons Hooded Oriole 44 0 34 36 41 1 Hooded Warbler Features Hooded Oriole Features Hooded Warbler Figure 14. Faithful global interpretability on CUB-2011: DINO-QPM autonomously discovers the 5 diverse, generalisable features for each class used to represent the Hooded Oriole and Hooded Warbler, completely without external ...

work page 2011

[78] [78]

Detailed Results Method Local. Features Accuracy↑ Faithful.↑ SID@5↑Class-Indep.↑Contrast.↑ CUB CARS CUB CARS CUB CARS CUB CARS DINOv2ffrozCLSLinear Probe✗87.9 ±0.1 91.7±0.1 42.6±0.2 50.9±0.2 51.5±0.199.2±0.099.1±0.0 59.2±0.0 60.9±0.0 DenseFfroz ✓78.1±0.3 92.9±0.1 32.7±0.291.8±0.793.1±0.1 98.8 ±0.0 98.7 ±0.0 84.5±0.3 82.8±0.1 Resnet50 Baseline [44]✓83.9±0....

work page 2011