pith. sign in

arxiv: 2604.07166 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.HC· cs.LG

DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification

Pith reviewed 2026-05-10 17:35 UTC · model grok-4.3

classification 💻 cs.CV cs.HCcs.LG
keywords DINOv2interpretabilityimage classificationadaptersparsity lossfrozen backbonequadratic programmingvisual foundation models
0
0 comments X

The pith

DINO-QPM turns frozen DINOv2 patch embeddings into human-interpretable class-independent features while beating linear-probe accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DINO-QPM attaches a lightweight adapter based on the Quadratic Programming Enhanced Model to a frozen DINO backbone. It replaces the usual CLS token with average-pooled patch embeddings and adds a sparsity loss so that the resulting representations remain contrastive and spatially localizable. The adapter produces globally consistent explanations grounded in object parts rather than background. Experiments on standard benchmarks show higher classification accuracy than a DINOv2 linear probe together with better scores on a new Plausibility metric and other interpretability measures. The approach therefore brings QPM-style interpretability to visual foundation models without any backbone retraining.

Core claim

DINO-QPM adapts the Quadratic Programming Enhanced Model as a lightweight interpretability adapter for strictly frozen DINO backbones. By using average-pooling of patch embeddings instead of the CLS token and imposing a sparsity loss that minimizes spatial scatter, it converts entangled high-dimensional features into contrastive, class-independent representations that support human-plausible global explanations and direct spatial localization in the input image.

What carries the argument

QPM adapter applied to average-pooled patch embeddings from a frozen DINO backbone, regularized by a sparsity loss that reduces background noise.

If this is right

  • Classification accuracy exceeds that of a standard DINOv2 linear probe.
  • Explanations become globally consistent and spatially localizable because patch embeddings connect directly to input space.
  • A sparsity loss forces explanations to focus on relevant object parts instead of background noise.
  • The full interpretability level previously available only with QPM becomes usable as a plug-in adapter for any frozen visual foundation model.
  • The method outperforms other applicable techniques for frozen backbones on both accuracy and explanation-quality metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adapter pattern could be tested on other frozen foundation models such as CLIP or larger ViT variants to check generalization.
  • The sparsity-driven focus on object parts may reduce reliance on spurious correlations that plague many post-hoc explanation techniques.
  • Standardized use of a Plausibility metric could encourage consistent benchmarking of explanation quality across future interpretable-vision papers.
  • If the adapter preserves performance while adding interpretability, regulatory or safety-critical vision deployments could adopt it without retraining large backbones.

Load-bearing premise

Average-pooling patch embeddings plus a sparsity loss on a frozen backbone yields globally consistent human-plausible explanations without new fitting artifacts or loss of discriminative power.

What would settle it

If DINO-QPM classification accuracy falls below the DINOv2 linear probe on ImageNet or if its explanations fail to exceed competing methods on the introduced Plausibility metric, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2604.07166 by Bodo Rosenhahn, Robert Zimmermann, Thomas Norrenbrock.

Figure 1
Figure 1. Figure 1: Overview of our proposed DINO-QPM. The pipeline processes the (a) input image using the frozen backbone to produce patch [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Radar Plot demonstrating the quality of DINO-QPM [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of our proposed DINO-QPM. Patch embed [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of a Brewer’s Blackbird image with a Rusty Blackbird image. From the selected features [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualisation of the Plausibility metric on a sample from [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of the number of MLP layers (Nlayers) on clas￾sification accuracy. The plot compares the accurcy on CUB-2011 of using frozen patch-level feature maps (F froz) versus the global feature vector (f froz CLS) for both Dense and QPM 20 30 40 50 60 N Selected Features N∗ f 3 4 5 6 7 N Features per Class Nc f 87.7 88.0 88.0 87.6 87.5 88.1 88.4 88.3 88.1 88.2 88.1 88.3 88.4 88.3 88.2 87.8 88.5 88.1 88.2 88.… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative ablation of LL1-FM on the Gray-crowned Rosy Finch. Without LL1-FM (top row), feature activations exhibit background noise and spatial scatter. Adding LL1-FM (bottom row) suppresses this noise, resulting in distinct activations semantically localised to specific object parts. pactness, e.g. using our Compact Model in Tab. 1. A significant increase in model accuracy which highly correlates with … view at source ↗
Figure 11
Figure 11. Figure 11: Accuracy and Feature Diversity (SID@5) on CUB-2011 across variations of the [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Impact of the LL1-FV on CUB-2011 accuracy during finetuning. 500 1000 1500 2000 2500 3000 3500 4000 Nhidden 87.6 87.8 88.0 88.2 88.4 Accuracy 1024 2048 4096 512 [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Mean finetuning accuracy on CUB-2011 for various numbers of features [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Faithful global interpretability on CUB-2011: DINO-QPM autonomously discovers the 5 diverse, generalisable features for each [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Faithful global interpretability on CUB-2011: DINO-QPM autonomously discovers the 5 diverse, generalisable features for [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Faithful global interpretability on Stanford Cars: DINO-QPM autonomously discovers the 5 diverse, generalisable features [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Faithful global interpretability on Stanford Cars: DINO-QPM autonomously discovers the 5 diverse, generalisable features [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Comparison on the Orchard Oriole (CUB-2011). We show test samples where our DINO-QPM correctly classi￾fies the image while the dense baseline fails. Columns from left to right: original image (X), GradCAM activation map of the dense model, the most similar training sample from the dense-predicted class alongside its cosine similarity score (sim = maxs∈Scpred CosSim(f froz CLS(X), f froz CLS(s))), and the … view at source ↗
Figure 19
Figure 19. Figure 19: Comparison on the Louisiana Waterthrush (CUB-2011). We show test samples where our DINO-QPM correctly clas￾sifies the image while the dense baseline fails. Columns from left to right: original image (X), GradCAM activation map of the dense model, the most similar training sample from the dense-predicted class alongside its cosine similarity score (sim = maxs∈Scpred CosSim(f froz CLS(X), f froz CLS(s))), a… view at source ↗
Figure 20
Figure 20. Figure 20: Comparison on the Summer Tanager (CUB-2011). We compare the dense baseline and DINO-QPM on eight test images, correctly classified by both models. Each sample is shown as a triplet: the original image (X), the GradCAM attribution of the dense model, and the local explanation of DINO-QPM for the true class ctrue. The GradCAM attributions of the dense model frequently spread across the background or miss th… view at source ↗
Figure 21
Figure 21. Figure 21: Indigo Bunting samples from the CUB-2011 test set. We compare the dense baseline and DINO-QPM on eight test images, [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Failure analysis on the Green Kingfisher (CUB-2011). Comparison of test samples misclassified by both models. Columns (left to right): original image (X); dense model GradCAM; DINO-QPM local explanations for both the true and predicted classes; and the nearest training sample from the predicted class with its cosine similarity (sim = maxs∈Scpred CosSim(f froz CLS(X), f froz CLS(s))). Although both models … view at source ↗
Figure 23
Figure 23. Figure 23: Failure analysis on the Least Flycatcher (CUB-2011). We show test samples where the dense baseline classifies correctly but DINO-QPM does not. Columns from left to right: original image (X), GradCAM attribution of the dense model, DINO-QPM local explanations for the true and predicted classes, and the most similar training sample from the predicted class with its cosine similarity score (sim = maxs∈Scpred… view at source ↗
read the original abstract

Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the \texttt{CLS} token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model's features and therefore enable spatial localisation of DINO-QPM's globally interpretable features within the input space. Furthermore, we apply a sparsity loss to minimise spatial scatter and background noise, ensuring that explanations are grounded in relevant object parts. With DINO-QPM we make the level of interpretability of QPM available as an adapter while exceeding the accuracy of DINOv2 linear probe. Evaluated through an introduced Plausibility metric and other interpretability metrics, extensive experiments demonstrate that DINO-QPM is superior to other applicable methods for frozen visual foundation models in both classification accuracy and explanation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes DINO-QPM, a lightweight adapter for strictly frozen DINOv2 backbones that replaces the CLS token with average-pooling over patch embeddings and adds a sparsity loss. This is claimed to convert entangled foundation-model features into contrastive, globally interpretable representations while exceeding the accuracy of a DINOv2 linear probe and improving explanation quality as measured by a newly introduced Plausibility metric together with other interpretability scores.

Significance. If the accuracy claim and the validity of the Plausibility metric are substantiated, the work would be significant for extending globally interpretable methods such as QPM to modern frozen visual foundation models without backbone retraining. The adapter design and explicit spatial-localization goal address a practical gap between high-performance feature extractors and human-plausible explanations.

major comments (3)
  1. [Abstract] Abstract: the headline claim that DINO-QPM exceeds DINOv2 linear-probe accuracy is stated without any numerical values, standard deviations, or references to specific tables or figures; the same paragraph introduces the Plausibility metric but supplies no definition, formula, or validation procedure against human judgments.
  2. [§3] §3 (Method): the central accuracy claim rests on the untested assumption that uniform average-pooling of patch embeddings plus a sparsity loss will preserve (or exceed) the discriminative power already encoded in the learned CLS token of a frozen DINOv2 backbone; no ablation isolating the pooling operator versus a CLS-based adapter is reported.
  3. [§4] §4 (Experiments): the sparsity-loss weight and QPM regularization parameters are listed as free hyperparameters; without an explicit statement that they were selected on a held-out validation split separate from the reported test sets, the reported superiority on both accuracy and interpretability metrics risks being inflated by post-hoc tuning.
minor comments (2)
  1. [§3] Notation for the QPM adapter layers and the exact form of the sparsity loss could be clarified with an equation block or pseudocode.
  2. [§4] Figure captions should explicitly state the backbone, dataset split, and whether the backbone is frozen when presenting qualitative explanation maps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that DINO-QPM exceeds DINOv2 linear-probe accuracy is stated without any numerical values, standard deviations, or references to specific tables or figures; the same paragraph introduces the Plausibility metric but supplies no definition, formula, or validation procedure against human judgments.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript, we will include numerical accuracy improvements (with standard deviations) and direct references to the relevant tables and figures. We will also add a concise definition of the Plausibility metric together with a reference to its full formulation and evaluation procedure in the main text. revision: yes

  2. Referee: [§3] §3 (Method): the central accuracy claim rests on the untested assumption that uniform average-pooling of patch embeddings plus a sparsity loss will preserve (or exceed) the discriminative power already encoded in the learned CLS token of a frozen DINOv2 backbone; no ablation isolating the pooling operator versus a CLS-based adapter is reported.

    Authors: The referee correctly identifies the lack of an explicit ablation isolating average-pooling from a CLS-token adapter. While our design is motivated by the requirement for spatial localization (which the CLS token cannot support), we acknowledge that a direct comparison would strengthen the evidence. We will add such an ablation study in the revised version. revision: yes

  3. Referee: [§4] §4 (Experiments): the sparsity-loss weight and QPM regularization parameters are listed as free hyperparameters; without an explicit statement that they were selected on a held-out validation split separate from the reported test sets, the reported superiority on both accuracy and interpretability metrics risks being inflated by post-hoc tuning.

    Authors: The hyperparameters were selected via tuning on held-out validation splits that are disjoint from the reported test sets. We will make this procedure explicit in the revised Experiments section, including the specific validation protocol and selected values. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external baselines and standard hyperparameter practice

full rationale

The paper proposes an adapter method (average-pooling + sparsity loss on frozen DINOv2) and evaluates it empirically against DINOv2 linear probe and other methods using introduced metrics. No equations, derivations, or uniqueness theorems are presented that reduce to fitted parameters or self-citations by construction. Hyperparameter choices (e.g., sparsity weight) follow standard validation-set tuning and do not force the reported accuracy or plausibility gains. The central claims remain falsifiable via the external comparisons shown in the experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The adapter relies on the assumption that DINO patch embeddings contain spatially meaningful information that can be linearly combined after average pooling. No new entities are postulated. Hyperparameters of the sparsity loss and QPM solver are free parameters whose values are not reported in the abstract.

free parameters (2)
  • sparsity_loss_weight
    Controls the trade-off between classification accuracy and spatial focus; value not stated in abstract.
  • QPM_regularization_parameters
    Inherited from the original QPM formulation; tuning details absent.
axioms (1)
  • domain assumption Average pooling of patch embeddings preserves sufficient spatial information for global interpretability.
    Invoked when the paper states that average-pooling enables spatial localisation within the input space.

pith-pipeline@v0.9.0 · 5542 in / 1405 out tokens · 55161 ms · 2026-05-10T17:35:58.077431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages

  1. [1]

    https://www.allaboutbirds.org/guide/Rusty Blackbird/id

    Rusty Blackbird Identification, All About Birds, Cornell Lab of Ornithology. https://www.allaboutbirds.org/guide/Rusty Blackbird/id. 5

  2. [2]

    Quantifying Attention Flow in Transformers, 2020

    Samira Abnar and Willem Zuidema. Quantifying Attention Flow in Transformers, 2020. 3

  3. [3]

    Jaakkola

    David Alvarez-Melis and Tommi S. Jaakkola. Towards ro- bust interpretability with self-explaining neural networks. CoRR, abs/1806.07538, 2018. 2, 6

  4. [4]

    Birds look like cars: adversarial analysis of intrinsically interpretable deep learning.Machine Learning, 114(12), 2025

    Hubert Baniecki and Przemyslaw Biecek. Birds look like cars: adversarial analysis of intrinsically interpretable deep learning.Machine Learning, 114(12), 2025. 2

  5. [5]

    B-cos Net- works: Alignment is All We Need for Interpretability

    Moritz Bohle, Mario Fritz, and Bernt Schiele. B-cos Net- works: Alignment is All We Need for Interpretability. In 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 10319–10328, New Or- leans, LA, USA, 2022. IEEE. 3

  6. [6]

    B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers, 2024

    Moritz B ¨ohle, Navdeeppal Singh, Mario Fritz, and Bernt Schiele. B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers, 2024. 3

  7. [7]

    Class-Discriminative Attention Maps for Vision Transform- ers, 2024

    Lennart Brocki, Jakub Binda, and Neo Christopher Chung. Class-Discriminative Attention Maps for Vision Transform- ers, 2024. 2, 3

  8. [8]

    Di- noV1: Emerging Properties in Self-Supervised Vision Trans- formers, 2021

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Di- noV1: Emerging Properties in Self-Supervised Vision Trans- formers, 2021. 2, 4, 5

  9. [9]

    Transformer Inter- pretability Beyond Attention Visualization, 2021

    Hila Chefer, Shir Gur, and Lior Wolf. Transformer Inter- pretability Beyond Attention Visualization, 2021. 2, 3

  10. [10]

    This Looks Like That: Deep Learning for Interpretable Image Recognition, 2019

    Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, and Cynthia Rudin. This Looks Like That: Deep Learning for Interpretable Image Recognition, 2019. 1

  11. [11]

    A Simple Framework for Contrastive Learn- ing of Visual Representations, 2020

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A Simple Framework for Contrastive Learn- ing of Visual Representations, 2020. 4, 7, 12

  12. [12]

    Context Autoencoder for Self- Supervised Representation Learning, 2023

    Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context Autoencoder for Self- Supervised Representation Learning, 2023. 2

  13. [13]

    Evaluating Visual Explanations of Atten- tion Maps for Transformer-based Medical Imaging

    Minjae Chung, Jong Bum Won, Ganghyun Kim, Yujin Kim, and Utku Ozbulak. Evaluating Visual Explanations of Atten- tion Maps for Transformer-based Medical Imaging. pages 110–120. 2025. 2

  14. [14]

    Learning to Esti- mate Shapley Values with Vision Transformers, 2023

    Ian Covert, Chanwoo Kim, and Su-In Lee. Learning to Esti- mate Shapley Values with Vision Transformers, 2023. 3

  15. [15]

    Surgical-dino: Adapter learning of foundation models for depth estimation in endoscopic surgery, 2024

    Beilei Cui, Mobarakol Islam, Long Bai, and Hongliang Ren. Surgical-dino: Adapter learning of foundation models for depth estimation in endoscopic surgery, 2024. 1, 4

  16. [16]

    Vision Transformers Need Registers, 2024

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers Need Registers, 2024. 4, 5

  17. [17]

    The Road Less Scheduled, 2024

    Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Kon- stantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky. The Road Less Scheduled, 2024. 2

  18. [18]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021. 2

  19. [19]

    CUB-200-2011 Segmentations, 2022

    Ryan Farrell. CUB-200-2011 Segmentations, 2022. Seg- mentation masks for the CUB-200-2011 dataset. 6

  20. [20]

    Pruning by block benefit: Exploring the properties of vision transformer blocks during domain adaptation

    Patrick Glandorf and Bodo Rosenhahn. Pruning by block benefit: Exploring the properties of vision transformer blocks during domain adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3006–3016, 2025. 3

  21. [21]

    Hy- persparse neural networks: Shifting exploration to exploita- tion through adaptive regularization

    Patrick Glandorf, Timo Kaiser, and Bodo Rosenhahn. Hy- persparse neural networks: Shifting exploration to exploita- tion through adaptive regularization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1234–1243, 2023. 3

  22. [22]

    Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko

    Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised Learning, 2020. 4

  23. [23]

    Gurobi optimizer reference manual, 2024

    Gurobi Optimization, LLC. Gurobi optimizer reference manual, 2024. 2

  24. [24]

    Hadsell, S

    R. Hadsell, S. Chopra, and Y . LeCun. Dimensionality Re- duction by Learning an Invariant Mapping. In2006 IEEE Computer Society Conference on Computer Vision and Pat- tern Recognition (CVPR’06), pages 1735–1742, 2006. 4

  25. [25]

    Momentum Contrast for Unsupervised Visual Rep- resentation Learning, 2020

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum Contrast for Unsupervised Visual Rep- resentation Learning, 2020. 4

  26. [26]

    does it? shortcomings of latent space prototype interpretability in deep networks

    Adrian Hoffmann, Claudio Fanconi, Rahul Rade, and Jonas Kohler. This looks like that... does it? shortcomings of la- tent space prototype interpretability in deep networks.CoRR, abs/2105.02968, 2021. 2

  27. [27]

    Inman and Edwin L

    Henry F. Inman and Edwin L. Bradley Jr. The overlapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two nor- mal densities.Communications in Statistics - Theory and Methods, 18(10):3851–3874, 1989. 6, 2

  28. [28]

    Optimal transport aggre- gation for visual place recognition, 2023

    Sergio Izquierdo and Javier Civera. Optimal transport aggre- gation for visual place recognition, 2023. 1, 4

  29. [29]

    Towards Faithfully Inter- pretable NLP Systems: How should we define and evaluate faithfulness?, 2020

    Alon Jacovi and Yoav Goldberg. Towards Faithfully Inter- pretable NLP Systems: How should we define and evaluate faithfulness?, 2020. 6

  30. [30]

    Sarthak Jain and Byron C. Wallace. Attention is not Expla- nation, 2019. 2

  31. [31]

    Uncertainsam: Fast and efficient uncertainty quantification of the segment anything model

    Timo Kaiser, Thomas Norrenbrock, and Bodo Rosenhahn. Uncertainsam: Fast and efficient uncertainty quantification of the segment anything model. InForty-second Interna- tional Conference on Machine Learning. 1

  32. [32]

    Explainability of Vision Transformers: A Comprehensive Review and New Perspectives, 2023

    Rojina Kashefi, Leili Barekatain, Mohammad Sabokrou, and Fatemeh Aghaeipoor. Explainability of Vision Transformers: A Comprehensive Review and New Perspectives, 2023. 3

  33. [33]

    Sunnie S. Y . Kim, Nicole Meister, Vikram V . Ramaswamy, Ruth Fong, and Olga Russakovsky. Hive: Evaluating the human interpretability of visual explanations, 2022. 2

  34. [34]

    3D object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. In 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13), Sydney, Australia, 2013. 5

  35. [35]

    Contrastive explanation.Royal Institute of Phi- losophy Supplement, 27:247–266, 1990

    Peter Lipton. Contrastive explanation.Royal Institute of Phi- losophy Supplement, 27:247–266, 1990. 2, 6

  36. [36]

    Data or Language Supervision: What Makes CLIP Better than DINO?, 2025

    Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, and Serena Yeung-Levy. Data or Language Supervision: What Makes CLIP Better than DINO?, 2025. 8

  37. [37]

    Interpretable Image Classification with Adaptive Prototype-based Vision Trans- formers

    Chiyu Ma, Jon Donnelly, Wenjun Liu, Soroush V osoughi, Cynthia Rudin, and Chaofan Chen. Interpretable Image Classification with Adaptive Prototype-based Vision Trans- formers. 2024. 1, 3

  38. [38]

    A Closer Look at Benchmarking Self-Supervised Pre- training with Image Classification, 2024

    Markus Marks, Manuel Knott, Neehar Kondapaneni, Elijah Cole, Thijs Defraeye, Fernando Perez-Cruz, and Pietro Per- ona. A Closer Look at Benchmarking Self-Supervised Pre- training with Image Classification, 2024. 8

  39. [39]

    Explanation in artificial intelligence: Insights from the social sciences.Artificial Intelligence, 267:1–38,

    Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial Intelligence, 267:1–38,

  40. [40]

    Neural Prototype Trees for Interpretable Fine-grained Image Recog- nition, 2021

    Meike Nauta, Ron van Bree, and Christin Seifert. Neural Prototype Trees for Interpretable Fine-grained Image Recog- nition, 2021. 1

  41. [41]

    Take 5: Interpretable Image Classification with a Handful of Features, 2023

    Thomas Norrenbrock, Marco Rudolph, and Bodo Rosen- hahn. Take 5: Interpretable Image Classification with a Handful of Features, 2023. 2, 3, 5, 7, 1, 12

  42. [42]

    Q-SENN: Quantized Self-Explaining Neural Net- works.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21482–21491, 2024

    Thomas Norrenbrock, Marco Rudolph, and Bodo Rosen- hahn. Q-SENN: Quantized Self-Explaining Neural Net- works.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21482–21491, 2024. 7, 1, 12

  43. [43]

    CHiQPM: Calibrated Hierarchical Interpretable Image Clas- sification, 2025

    Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Nesli- han Kose, Ramesh Manuvinakurike, and Bodo Rosenhahn. CHiQPM: Calibrated Hierarchical Interpretable Image Clas- sification, 2025. 2

  44. [44]

    QPM: Discrete Op- timization for Globally Interpretable Image Classification,

    Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Ramesh Manuvinakurike, and Bodo Rosenhahn. QPM: Discrete Op- timization for Globally Interpretable Image Classification,

  45. [45]

    2, 3, 4, 5, 6, 7, 8, 1, 12

  46. [46]

    Nguyen, and Tsui- Wei Weng

    Tuomas Oikarinen, Subhro Das, Lam M. Nguyen, and Tsui- Wei Weng. Label-Free Concept Bottleneck Models, 2023. 2, 3

  47. [47]

    DINOv2: Learning Robust Visual Features without Supervision, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

  48. [48]

    IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers

    Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers. InAdvances in Neural Information Process- ing Systems, pages 24898–24911. Curran Associates, Inc.,

  49. [49]

    PyTorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-per...

  50. [50]

    DCBM: Data-Efficient Visual Concept Bottleneck Models, 2025

    Katharina Prasse, Patrick Knab, Sascha Marton, Christian Bartelt, and Margret Keuper. DCBM: Data-Efficient Visual Concept Bottleneck Models, 2025. 3

  51. [51]

    Read and Amy Marcus-Newhall

    Stephen J. Read and Amy Marcus-Newhall. Explanatory co- herence in social explanations: A parallel distributed pro- cessing account.Journal of Personality and Social Psychol- ogy, 65(3):429–447, 1993. 2, 6

  52. [52]

    Principles of categorization

    Eleanor Rosch. Principles of categorization. InCognition and categorization, pages 27–48. Lawrence Erlbaum Asso- ciates, Hillsdale, NJ, 1978. 1

  53. [53]

    Optimization of sparsity-constrained neu- ral networks as a mixed integer linear program.Journal of Optimization Theory and Applications, 199(3):931–954,

    Bodo Rosenhahn. Optimization of sparsity-constrained neu- ral networks as a mixed integer linear program.Journal of Optimization Theory and Applications, 199(3):931–954,

  54. [54]

    Stop explaining black box machine learn- ing models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215,

    Cynthia Rudin. Stop explaining black box machine learn- ing models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215,

  55. [55]

    ProtoPShare: Prototype Sharing for Interpretable Image Classification and Similarity Discovery

    Dawid Rymarczyk, Łukasz Struski, Jacek Tabor, and Bartosz Zieli´nski. ProtoPShare: Prototype Sharing for Interpretable Image Classification and Similarity Discovery. InProceed- ings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1420–1430, 2021. 1

  56. [56]

    Com- mittee on the Status of Endangered Wildlife in Canada, Ot- tawa, 2006

    Carl Savignac.COSEWIC Assessment and Status Report on the Rusty Blackbird, Euphagus Carolinus, in Canada. Com- mittee on the Status of Endangered Wildlife in Canada, Ot- tawa, 2006. 5

  57. [57]

    FaceNet: A Unified Embedding for Face Recognition and Clustering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015. 4

  58. [58]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.International Journal of Computer Vision, 128(2):336–359, 2020. 6

  59. [59]

    Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

  60. [60]

    Concept Bottleneck Large Language Models,

    Chung-En Sun, Tuomas Oikarinen, Berk Ustun, and Tsui- Wei Weng. Concept Bottleneck Large Language Models,

  61. [61]

    ProtoS-ViT: Visual foundation models for sparse self-explainable classifications, 2024

    Hugues Turb ´e, Mina Bjelogrlic, Gianmarco Mengaldo, and Christian Lovis. ProtoS-ViT: Visual foundation models for sparse self-explainable classifications, 2024. 1

  62. [62]

    Tell me why: Visual foundation models as self-explainable classifiers, 2025

    Hugues Turb ´e, Mina Bjelogrlic, Gianmarco Mengaldo, and Christian Lovis. Tell me why: Visual foundation models as self-explainable classifiers, 2025. 1

  63. [63]

    Represen- tation Learning with Contrastive Predictive Coding, 2019

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen- tation Learning with Contrastive Predictive Coding, 2019. 4

  64. [64]

    Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

    Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Be- longie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 595–604,

  65. [65]

    C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-UCSD birds-200-2011 dataset. Technical Re- port CNS-TR-2011-001, California Institute of Technology,

  66. [66]

    Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification, 2023

    Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification, 2023. 2

  67. [67]

    Post-hoc Concept Bottleneck Models, 2023

    Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc Concept Bottleneck Models, 2023. 2, 3

  68. [68]

    Top-down neu- ral attention by excitation backprop

    Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neu- ral attention by excitation backprop. InComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 543–559. Springer, 2016. 2, 6

  69. [69]

    A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence, 2023

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola- nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence, 2023. 1, 4

  70. [70]

    Partially Shared Concept Bottleneck Models, 2025

    Delong Zhao, Qiang Huang, Di Yan, Yiqun Sun, and Jun Yu. Partially Shared Concept Bottleneck Models, 2025. 3

  71. [71]

    Interpretable Image Classification via Non-parametric Part Prototype Learning, 2025

    Zhijie Zhu, Lei Fan, Maurice Pagnucco, and Yang Song. Interpretable Image Classification via Non-parametric Part Prototype Learning, 2025. 1, 3 DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification Supplementary Material

  72. [72]

    [41] introduced the Feature Diversity Loss, hereafter referred to asL div

    Feature Diversity Loss To reduce conceptual ambiguity between features, Norrenbrock et al. [41] introduced the Feature Diversity Loss, hereafter referred to asL div. The objective ofL div is to encourage the represen- tation of distinct, mutually independent concepts within the fea- tures, thereby enhancing the degree of model interpretability. Let i∈ I={...

  73. [73]

    [41, 42, 44]

    Definition of Additional Interpretability Metrics To assess model interpretability, we apply several metrics follow- ing Norrenbrock et al. [41, 42, 44]. Since interpretability is mul- tifaceted, multiple metrics addressing distinct concepts are neces- sary. Throughout this section, we utilise the following notation for index sets:i∈ I={1, . . . , W f }an...

  74. [74]

    Unless otherwise specified, the Multi-Layer Perceptron (MLP) consists of four layers featuring ReLU activation and batch nor- malisation

    Implementation Details All input images are resized to224×224pixels and normalised according to the dataset mean values. Unless otherwise specified, the Multi-Layer Perceptron (MLP) consists of four layers featuring ReLU activation and batch nor- malisation. The number of features is set toN f = 512, and the number of neurons in the hidden layers isNhidde...

  75. [75]

    [41] and intro- duced in detail in Sec

    Impact of Auxiliary Losses TheL div loss, as proposed by Norrenbrock et al. [41] and intro- duced in detail in Sec. 7, is analysed here. Fig. 11 illustrates the influence ofL div on accuracy and SID@5. Notably, increasing the weight of this loss has a strong positive correlation with SID@5. Hence, the lightweight interpretability adapter can be steered si...

  76. [76]

    13 illustrates the accuracy plotted against the number of neu- rons in the MLP’s hidden layersNhidden

    Impact of MLP Depth Fig. 13 illustrates the accuracy plotted against the number of neu- rons in the MLP’s hidden layersNhidden. Small accuracy gains are observed up toN hidden = 2048, regardless of the number of fea- turesN f which is why we choseN hidden = 2048andN f = 512, obtaining optimal accuracy while minimising compactness. 50 60 70 80 90 SID@5 0 1...

  77. [77]

    Class Comparisons Hooded Oriole 44 0 34 36 41 1 Hooded Warbler Features Hooded Oriole Features Hooded Warbler Figure 14

    Visualisations 12.1. Class Comparisons Hooded Oriole 44 0 34 36 41 1 Hooded Warbler Features Hooded Oriole Features Hooded Warbler Figure 14. Faithful global interpretability on CUB-2011: DINO-QPM autonomously discovers the 5 diverse, generalisable features for each class used to represent the Hooded Oriole and Hooded Warbler, completely without external ...

  78. [78]

    Detailed Results Method Local. Features Accuracy↑ Faithful.↑ SID@5↑Class-Indep.↑Contrast.↑ CUB CARS CUB CARS CUB CARS CUB CARS DINOv2ffrozCLSLinear Probe✗87.9 ±0.1 91.7±0.1 42.6±0.2 50.9±0.2 51.5±0.199.2±0.099.1±0.0 59.2±0.0 60.9±0.0 DenseFfroz ✓78.1±0.3 92.9±0.1 32.7±0.291.8±0.793.1±0.1 98.8 ±0.0 98.7 ±0.0 84.5±0.3 82.8±0.1 Resnet50 Baseline [44]✓83.9±0....