pith. sign in

arxiv: 2606.27582 · v2 · pith:R5NNPAUWnew · submitted 2026-06-25 · 💻 cs.CV

Beyond Points: Spherical Distributional Part Prototypes for Interpretable Classification

Pith reviewed 2026-07-01 06:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords prototype-based interpretabilityvon Mises-Fisher distributionsentropic optimal transportpart prototypesdirectional embeddingsfine-grained classificationexplanation quality
0
0 comments X

The pith

vMFProto replaces point prototypes with von Mises-Fisher distributions on the hypersphere to capture intra-class part variability and deliver more consistent, stable explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prototype networks ground image classifications in a handful of learned part examples, yet fixed point prototypes become redundant once modern directional embeddings exhibit large within-class spread. The paper shows that representing each prototype as its own von Mises-Fisher distribution, whose concentration adapts to observed variability, combined with entropic optimal transport for assignment, removes that redundancy. A two-stage schedule first discovers prototypes via transport and then refines them with patch distillation and diversity regularization. On CUB-200-2011, Stanford Dogs, and Stanford Cars using frozen DINO backbones, the resulting explanations score highest on consistency, stability, and distinctiveness while accuracy remains competitive. Qualitative inspection confirms the prototypes stay localized and non-overlapping.

Core claim

vMFProto models each class as a mixture of von Mises-Fisher components on the hypersphere, lets every prototype learn its own concentration parameter to encode part-specific variability, and obtains structured patch-to-prototype assignments through entropic optimal transport; a two-stage training procedure first performs OT-driven discovery and then performs end-to-end refinement with patch-level distillation and distribution-aware diversity regularization, producing state-of-the-art explanation quality together with competitive accuracy on the three evaluated fine-grained datasets.

What carries the argument

Mixture of von Mises-Fisher distributions with per-prototype concentration parameters plus entropic optimal transport for structured assignments.

Load-bearing premise

Intra-class variability around each semantic part can be summarized by a single scalar concentration per von Mises-Fisher prototype without needing orientation parameters or higher-order statistics.

What would settle it

Re-train the identical architecture on the same three datasets after replacing the von Mises-Fisher components with isotropic or fixed-concentration alternatives and measure whether the reported gains in consistency, stability, and distinctiveness disappear.

Figures

Figures reproduced from arXiv: 2606.27582 by Carlos Santiago, Catarina Barata, Diogo Pereira Ara\'ujo, Duarte Le\~ao.

Figure 1
Figure 1. Figure 1: Why distributional prototypes? Top-4 prototypes activation maps on the same image. (a) Point-prototype SOTA model [26] and other methods often approx￾imate intra-part variability via redundant prototypes or by conflating multiple parts within one prototype. (b) Our spherical distributional prototypes capture within-part variation without duplicating prototypes or entangling distinct parts, yielding more lo… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the vMFProto framework. (Top) An input image x is processed by a ViT backbone with a frozen encoder and a single trainable block gϕ. A label-free gating mechanism G(·) (derived from frozen attention and PCA) filters background patches. Foreground tokens are passed to the vMF Block, which computes class-conditional evidence \ell _c(x) ; applying a softmax over \ifmmode \lbrace \else \textbracele… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on CUB-200-2011. Top-4 prototype activation maps (overlaid as heatmaps) for the ground-truth class on two test images. All meth￾ods use a DINOv2 ViT-B/14 backbone with J=5 prototypes per class. Compared to EvalProtoPNet, MGProto, and NPPP, vMFProto produces more localized and less redundant evidence, aligning better with semantically meaningful parts. Additional ablations on foregrou… view at source ↗
Figure 4
Figure 4. Figure 4: Why-table example on Stanford Dogs. Prediction-level explanation produced by vMF￾Proto for a test sample, show￾ing the top prototypical parts, their source patches, activation maps, and contribution scores. 5 Conclusion We presented vMFProto, a distributional part-prototype network for inter￾pretable classification. vMFProto models each class as a mixture of von Mises￾Fisher prototypes on the hypersphere a… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on CUB-200-2011. Top-4 prototype activation maps (overlaid as heatmaps) for the ground-truth class on five test images. All meth￾ods use a DINOv2 ViT-B/14 backbone with J=5 prototypes per class. Compared to EvalProtoPNet, MGProto, and NPPP, vMFProto produces more localized and less redundant evidence, aligning better with semantically meaningful parts [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on Stanford Cars. [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on Stanford Dogs. [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
read the original abstract

Prototype-based neural networks aim to provide intrinsic interpretability by grounding predictions in a small set of part prototypes. However, modern vision backbones typically operate in normalized, directional embedding spaces where each semantic part exhibits substantial intra-class variability. As a result, point prototypes often become redundant or unstable, hurting both explanation quality and robustness. We propose vMFProto, a distributional part-prototype framework that models each class as a mixture of von Mises-Fisher components on the hypersphere. Each prototype learns its own concentration, capturing part-specific variability, and we use entropic optimal transport (OT) to obtain structured patch-to-prototype assignments. A two-stage training schedule performs OT-driven prototype discovery followed by end-to-end refinement with patch-level distillation and distribution-aware diversity regularization. Experiments on CUB-200-2011, Stanford Dogs, and Stanford Cars with frozen DINO backbones show that vMFProto achieves state-of-the-art explanation quality (consistency, stability, and distinctiveness) with competitive accuracy. Qualitative results confirm that vMFProto yields localized, non-redundant part evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces vMFProto, a prototype-based classification method that replaces point prototypes with von Mises-Fisher distributional prototypes on the unit hypersphere to model intra-class part variability in normalized DINO embeddings. Each prototype has a learned concentration parameter, assignments are obtained via entropic optimal transport, and training proceeds in two stages (OT-driven discovery followed by end-to-end refinement with patch distillation and diversity regularization). On CUB-200-2011, Stanford Dogs, and Stanford Cars the method is claimed to deliver state-of-the-art explanation quality (consistency, stability, distinctiveness) while maintaining competitive accuracy with frozen backbones.

Significance. If the empirical claims are substantiated with proper controls, the work would offer a principled extension of prototype interpretability to modern directional embedding spaces, potentially reducing redundancy and improving stability of explanations in fine-grained recognition tasks.

major comments (2)
  1. [Abstract] Abstract: the central claim of state-of-the-art explanation quality (consistency, stability, and distinctiveness) is presented without any quantitative definition of those three metrics, without reported error bars, and without ablation of the per-prototype concentration or entropic OT components; these omissions are load-bearing because the abstract itself states that the performance follows from the new modeling choices.
  2. [Abstract] Abstract: the premise that point prototypes become redundant or unstable is asserted as motivation, yet no quantitative comparison (e.g., redundancy or stability scores for point vs. distributional prototypes) is supplied to support that the two-stage schedule actually resolves the issue.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the abstract. We address each point below and will revise the abstract to improve clarity and support for the claims while preserving its conciseness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of state-of-the-art explanation quality (consistency, stability, and distinctiveness) is presented without any quantitative definition of those three metrics, without reported error bars, and without ablation of the per-prototype concentration or entropic OT components; these omissions are load-bearing because the abstract itself states that the performance follows from the new modeling choices.

    Authors: We agree the abstract would benefit from tighter linkage to the supporting material. The three metrics receive quantitative definitions in Section 3.2 (consistency as mean intra-class prototype activation correlation, stability as assignment variance under augmentations, distinctiveness as minimum inter-prototype angular distance). Tables 2–4 report means with standard errors over five random seeds, and Section 4.3 contains the requested ablations isolating concentration learning and the entropic OT assignment. We will revise the abstract to add a short parenthetical clause (“see Sec. 3.2 and 4.3 for definitions, ablations, and error bars”) so the central claim is explicitly grounded without exceeding length limits. revision: yes

  2. Referee: [Abstract] Abstract: the premise that point prototypes become redundant or unstable is asserted as motivation, yet no quantitative comparison (e.g., redundancy or stability scores for point vs. distributional prototypes) is supplied to support that the two-stage schedule actually resolves the issue.

    Authors: The abstract states the motivation qualitatively, but the manuscript supplies the requested quantitative comparison in Section 4.2 and Table 1: point-prototype baselines exhibit higher average pairwise cosine similarity (redundancy) and higher assignment variance under perturbation (instability) than vMFProto; the two-stage schedule further reduces both quantities. We will insert a brief clause in the abstract (“quantitative comparisons in Sec. 4.2 confirm reduced redundancy and improved stability”) to make the motivation self-supporting. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces vMFProto as a new distributional prototype framework using per-prototype von Mises-Fisher concentrations and entropic OT assignments within a two-stage training schedule on frozen DINO embeddings. Claims of SOTA explanation quality and competitive accuracy are presented as direct empirical outcomes from experiments on CUB-200-2011, Stanford Dogs, and Stanford Cars. No equations, definitions, or load-bearing steps in the provided text reduce these results to quantities defined by the fitted concentrations, OT costs, or prior self-citations; the central construction remains independent of its own outputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on standard spherical statistics and optimal transport; the only new fitted quantities are the per-prototype concentration parameters and the OT regularization strength, both introduced to capture variability and structure assignments.

free parameters (2)
  • per-prototype concentration
    Each prototype learns its own concentration parameter to model part-specific variability on the hypersphere.
  • OT regularization strength
    Controls the entropy term in the entropic optimal transport used for patch-to-prototype matching.
axioms (2)
  • domain assumption von Mises-Fisher distributions are appropriate models for directional data on the unit hypersphere
    Invoked when replacing point prototypes with distributional components in normalized embedding spaces.
  • domain assumption entropic optimal transport yields structured, non-redundant assignments between patches and prototypes
    Central to the claim that the method avoids redundancy and improves explanation quality.

pith-pipeline@v0.9.1-grok · 5727 in / 1474 out tokens · 37659 ms · 2026-07-01T06:20:51.031729+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 9 canonical work pages · 6 internal anchors

  1. [1]

    Bafghi, R.A., Harilal, N., Monteleoni, C., Raissi, M.: Parameter efficient fine- tuningofself-supervisedvitswithoutcatastrophicforgetting.In:Proceedingsofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3679– 3684 (2024)

  2. [2]

    Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learn. Res.6, 1345–1382 (2005), https://www.jmlr.org/papers/v6/banerjee05a.html

  3. [3]

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Int. Conf. Comput. Vis. pp. 9650–9660 (2021),https://arxiv.org/abs/2104.14294

  4. [4]

    Chen, C., Li, O., Tao, C., Barnett, A.J., Su, J., Rudin, C.: This looks like that: Deep learning for interpretable image recognition. In: Adv. Neural Inform. Process. Syst. (2019),https://proceedings.neurips.cc/paper/2019/hash/ adf7ee2dcf142b0e11888e72b43fcb75-Abstract.html

  5. [5]

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: Int. Conf. Mach. Learn. pp. 1597– 1607 (2020),https://arxiv.org/abs/2002.05709

  6. [6]

    Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. In: Adv. Neural Inform. Process. Syst. (2013),https://proceedings.neurips.cc/ paper/2013/hash/af21d0c97db2e27e13572cbf59eb343d-Abstract.html

  7. [7]

    In: IEEE Conf

    Donnelly, J., Barnett, A.J., Chen, C.: Deformable ProtoPNet: An interpretable image classifier using deformable prototypes. In: IEEE Conf. Comput. Vis. Pat- tern Recog. pp. 10265–10275 (2022),https://openaccess.thecvf.com/content/ CVPR2022 / html / Donnelly _ Deformable _ ProtoPNet _ An _ Interpretable _ Image _ Classifier_Using_Deformable_Prototypes_CVPR...

  8. [8]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)

  9. [9]

    Huang, Q., Xue, M., Huang, W., Zhang, H., Song, J., Jing, Y., Song, M.: Evalu- ation and improvement of interpretability for self-explainable part-prototype net- works. In: Int. Conf. Comput. Vis. pp. 2011–2020 (2023),https://openaccess. thecvf.com/content/ICCV2023/html/Huang_Evaluation_and_Improvement_of_ Interpretability_ for _ Self - Explainable _ Part...

  10. [10]

    In: First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition

    Khosla, A., Jayadevaprakash, N., Yao, B., Fei-Fei, L.: Novel dataset for fine-grained image categorization. In: First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, CO (June 2011),http://vision.stanford.edu/aditya86/ImageNetDogs/

  11. [11]

    In: ICCV Workshops (2013),https://openaccess

    Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: ICCV Workshops (2013),https://openaccess. thecvf . com / content _ iccv _ workshops _ 2013 / W19 / html / Krause _ 3D _ Object _ Representations_2013_ICCV_paper.html

  12. [12]

    In: IEEE Conf

    Nauta, M., van Bree, R., Seifert, C.: Neural prototype trees for interpretable fine-grained image recognition. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14933–14943 (2021),https://openaccess.thecvf.com/content/CVPR2021/html/ Nauta _ Neural _ Prototype _ Trees _ for _ Interpretable _ Fine - Grained _ Image _ Recognition_CVPR_2021_paper.html

  13. [13]

    In: IEEE Conf

    Nauta, M., Schlötterer, J., van Keulen, M., Seifert, C.: PIP-Net: Patch-based intu- itive prototypes for interpretable image classification. In: IEEE Conf. Comput. 16 D. Leão et al. Vis. Pattern Recog. pp. 2744–2753 (2023),https://openaccess.thecvf.com/ content/CVPR2023/html/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_ for_Interpretable_Image_Classific...

  14. [14]

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W.,Howes,R.,Huang,P.Y.,Li,S.W.,Misra,I.,Rabbat,M.,Sharma,V.,Synnaeve, G., Xu, H., Jegou, H., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual features without supervision....

  15. [15]

    Foundations and Trends in Machine Learning11(5–6), 355–607 (2019), https://optimaltransport.github.io/pdf/ComputationalOT.pdf

    Peyré, G., Cuturi, M.: Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning11(5–6), 355–607 (2019), https://optimaltransport.github.io/pdf/ComputationalOT.pdf

  16. [16]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transfer- able visual models from natural language supervision. In: Int. Conf. Mach. Learn. (2021),https://arxiv.org/abs/2103.00020

  17. [17]

    Rymarczyk, D., Struski, Ł., Górszczak, M., Lewandowska, K., Tabor, J., Zieliński, B.: Interpretable image classification with differentiable prototypes assignment. In: Eur. Conf. Comput. Vis. (2022),https://arxiv.org/abs/2112.02902

  18. [18]

    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- CAM: Visual explanations from deep networks via gradient-based localization. In: Int. Conf. Comput. Vis. (2017),https://openaccess.thecvf.com/content_iccv_ 2017/html/Selvaraju_Grad-CAM_Visual_Explanations_ICCV_2017_paper.html

  19. [19]

    In: Brit

    Siméoni, O., Puy, G., Vo, H.V., Roburin, S., Gidaris, S., Bursuc, A., Pérez, P., Marlet, R., Ponce, J.: LOST: Localizing objects with self-supervised transformers and no labels. In: Brit. Mach. Vis. Conf. (2021),https://arxiv.org/abs/2109. 14279

  20. [20]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., et al.: DINOv3. arXiv:2508.10104 (2025).https://doi.org/10.48550/arXiv.2508.10104,https: //arxiv.org/abs/2508.10104

  21. [21]

    Ukai, Y., Hirakawa, T., Yamashita, T., Fujiyoshi, H.: This looks like it rather than that: ProtoKNN for similarity-based classifiers. In: Int. Conf. Learn. Represent. (2023),https://openreview.net/forum?id=lh-HRYxuoRr

  22. [22]

    Wah,C.,Branson,S.,Welinder,P.,Perona,P.,Belongie,S.:Thecaltech-ucsdbirds- 200-2011 dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011),https://www.vision.caltech.edu/datasets/cub_200_2011/

  23. [23]

    IEEE Trans

    Wang, C., Chen, Y., Liu, F., Liu, Y., McCarthy, D.J., Frazer, H., Carneiro, G.: Mixture of Gaussian-distributed prototypes with generative modelling for inter- pretable and trustworthy image recognition. IEEE Trans. Pattern Anal. Mach. Intell.47(8), 6974–6989 (2025),https://arxiv.org/abs/2312.00092

  24. [24]

    Wang, J., Liu, H., Wang, X., Jing, L.: Interpretable image recognition by constructing transparent embedding space. In: Int. Conf. Comput. Vis. pp. 895–904 (2021),https: / /openaccess. thecvf. com /content /ICCV2021/ html / Wang _ Interpretable _ Image _ Recognition _ by _ Constructing _ Transparent _ Embedding_Space_ICCV_2021_paper.html

  25. [25]

    In: IJCAI

    Xue, M., Huang, Q., Zhang, H., Hu, J., Song, J., Song, M., Jin, C.: Protop- former: Concentrating on prototypical parts in vision transformers for interpretable image recognition. In: IJCAI. pp. 1516–1524 (2024),https://www.ijcai.org/ proceedings/2024/168 vMF Mixture Prototypes 17

  26. [26]

    why-table

    Zhu, Z., Fan, L., Pagnucco, M., Song, Y.: Interpretable image classification via non- parametric part prototype learning. In: IEEE Conf. Comput. Vis. Pattern Recog. (2025),https : / / openaccess . thecvf . com / content / CVPR2025 / papers / Zhu _ Interpretable_Image_Classification_via_Non-parametric_Part_Prototype_ Learning_CVPR_2025_paper.pdf 18 D. Leão...