pith. sign in

arxiv: 2605.02007 · v1 · submitted 2026-05-03 · 💻 cs.LG · cs.CV

How Can One Choose the Best CAM-Based Explainability Method for a CNN Model?

Pith reviewed 2026-05-09 17:24 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords CAMexplainabilitysaliency mapshuman perceptionCNNdistance metricscrowdsourcingRBO
0
0 comments X

The pith

Manhattan and correlation distances can identify which CAM explainability methods produce saliency maps that best match human perception.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether distance metrics other than intersection-over-union can better identify class activation mapping methods whose explanations align with how people actually see them. Researchers created saliency maps for chihuahua images using several CAM variants, measured how each map overlapped a human bounding box via multiple distances, and collected separate human rankings of the best-looking maps through crowdsourcing. They then compared the method orderings produced by each distance against the crowdsourced order using rank-biased overlap. Manhattan and correlation distances produced orderings closest to the human votes, and LayerCAM, Score-CAM, and IS-CAM ranked highest for human resemblance.

Core claim

The results indicate the feasibility of our method to find the explainability method that best resembles human perception. In our experiments, the two metrics that best resemble human perception corresponded to Manhattan and Correlation. Besides, the best explainability methods regarding human perception were LayerCAM, Score-CAM, and IS-CAM.

What carries the argument

Ranking of CAM methods by distance between their saliency maps and human bounding boxes, validated by how closely those rankings match crowdsourced human preference rankings measured with rank-biased overlap.

If this is right

  • Manhattan distance produces rankings of explainability methods that align most closely with human votes.
  • Correlation distance produces rankings that also align closely with human votes.
  • LayerCAM, Score-CAM, and IS-CAM generate the saliency maps that humans judge as best.
  • The evaluation procedure can feasibly guide selection of an explainability method for a CNN.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same comparison pipeline could be applied to other image datasets to check whether Manhattan and correlation remain the best distances.
  • Domains that require user trust in model decisions could adopt this human-perception filter before deploying an explainer.
  • Using eye-tracking records instead of bounding boxes might give a more precise ground truth for future comparisons.

Load-bearing premise

Numerical distances between a fixed human box and a variable-shaped saliency map actually reflect how people judge the quality of an explanation.

What would settle it

A fresh crowdsourcing round on the same images that produces a different top distance metric or different top CAM methods would show the current selection does not reliably capture human perception.

Figures

Figures reproduced from arXiv: 2605.02007 by Adriana C. F. Alvim, Daniel da Silva Costa, Pedro Nuno de Souza Moura.

Figure 1
Figure 1. Figure 1: From the top-left corner: firstly, the preprocessed image n02085620 view at source ↗
Figure 2
Figure 2. Figure 2: Examples of the final annotation heatmap for three images: view at source ↗
Figure 3
Figure 3. Figure 3: The preprocessed image n02085620 5542 on the top-left and the overlapped explanation heatmap of each explainability method. d(u, v) = X i |ui − vi | (6) The Correlation distance between u and v is defined as: d(u, v) = 1 − (u − u¯) · (v − v¯) ∥(u − u¯)∥2 ∥(v − v¯)∥2 (7) The Cosine distance between u and v is defined as: d(u, v) = 1 − u · v ∥u∥2 ∥v∥2 (8) The Euclidean distance between u and v is defined as:… view at source ↗
Figure 4
Figure 4. Figure 4: Selected explainability methods in the validation experiment for view at source ↗
read the original abstract

In recent years, several advances have been observed in Deep Learning with surprising results. Models in this area have been increasingly used in numerous applications, including those sensitive to human life, which require clear explanations and justifications. Various explainability methods have been proposed, but not many metrics to evaluate these methods. The most commonly used metric is the Intersection over Union (IoU). However, due to the characteristics of the results of the explainability methods, called saliency maps, which do not have a known shape, we hypothesise that there must be a better metric that allows one to find an explainability method that produces results that best resemble the human perception. We propose using different metrics to assess the similarity between human perception and the explanation saliency maps to find a better metric. An investigation was conducted employing a subset of the Chihuahuas images from ImageNet dataset. Several CAM-based explainability methods were used to generate saliency maps for each chihuahua image. Alignment was measured by applying distance metrics between the bounding box of human annotations and the saliency maps produced by each explainability method. Rankings of the best saliency maps were created using the results of the distance metrics and compared to the ranking obtained using people's choice, collected through crowdsourcing, of the best explanation saliency maps for each selected image. Comparison between rankings was performed using the Rank-Biased Overlap (RBO) metric. The results indicate the feasibility of our method to find the explainability method that best resembles human perception. In our experiments, the two metrics that best resemble human perception corresponded to Manhattan and Correlation. Besides, the best explainability methods regarding human perception were LayerCAM, Score-CAM, and IS-CAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes using distance metrics between human-annotated bounding boxes and continuous saliency maps generated by various CAM explainability methods to produce rankings of those methods; these rankings are then compared via Rank-Biased Overlap (RBO) to rankings derived from crowdsourced human votes on which saliency map best explains each image. The central claim is that this procedure identifies metrics (Manhattan and Correlation) and CAM variants (LayerCAM, Score-CAM, IS-CAM) that best align with human perception of explanation quality.

Significance. If the proxy relationship holds, the work supplies a practical, annotation-based procedure for selecting CAM methods without repeated crowdsourcing, addressing a recognized gap in quantitative evaluation of post-hoc explanations. The use of independent crowdsourced ground truth and RBO for ranking agreement are constructive elements that could be extended to other explanation families.

major comments (3)
  1. [Methods] Methods section: the procedure for computing distance metrics between binary bounding-box masks and continuous-valued saliency maps is not specified (normalization, any implicit thresholding, resolution alignment, or treatment of saliency mass outside the box). Because all subsequent rankings and RBO comparisons rest on these distances, the omission directly affects the reproducibility and validity of the reported superiority of Manhattan and Correlation.
  2. [Results] Results section: RBO scores comparing distance-based rankings to human-vote rankings are presented without error bars, participant agreement statistics, or significance tests. This leaves unclear whether the observed advantages for Manhattan and Correlation are statistically distinguishable from other metrics or merely within sampling variability.
  3. [Discussion] Discussion section: the manuscript does not test or discuss whether proximity to the full rectangular bounding box is a faithful proxy for human judgments of explanation quality. Humans may systematically prefer saliency maps that concentrate on class-discriminative sub-regions rather than uniform coverage of the entire annotated object; if so, high RBO would indicate only that the metric reproduces a particular ranking, not that it captures resemblance to human perception of good explanations.
minor comments (2)
  1. [Abstract] Abstract and Methods: the size of the Chihuahuas subset, selection criteria, and number of crowdsourcing participants are not reported, impeding assessment of statistical power and reproducibility.
  2. [Figures] Figures: inclusion of qualitative examples showing saliency maps overlaid with the human bounding boxes would clarify how the distance metrics behave on typical outputs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thorough review and positive assessment of the work's potential. Below, we provide point-by-point responses to the major comments, committing to revisions where appropriate to enhance clarity, reproducibility, and discussion of limitations.

read point-by-point responses
  1. Referee: [Methods] Methods section: the procedure for computing distance metrics between binary bounding-box masks and continuous-valued saliency maps is not specified (normalization, any implicit thresholding, resolution alignment, or treatment of saliency mass outside the box). Because all subsequent rankings and RBO comparisons rest on these distances, the omission directly affects the reproducibility and validity of the reported superiority of Manhattan and Correlation.

    Authors: We fully agree that the Methods section requires more explicit details to ensure reproducibility. In the revised manuscript, we will add a dedicated paragraph or subsection describing the exact procedure: saliency maps are first normalized to the [0, 1] interval, resized to the same spatial dimensions as the input image and bounding boxes using appropriate interpolation, no thresholding is performed to preserve the continuous nature of the maps, and the distance metrics are applied directly to the full maps, thereby penalizing any saliency mass outside the bounding box regions. revision: yes

  2. Referee: [Results] Results section: RBO scores comparing distance-based rankings to human-vote rankings are presented without error bars, participant agreement statistics, or significance tests. This leaves unclear whether the observed advantages for Manhattan and Correlation are statistically distinguishable from other metrics or merely within sampling variability.

    Authors: We recognize the importance of statistical validation for the reported RBO differences. In the revision, we will include inter-participant agreement metrics from the crowdsourcing study (e.g., average pairwise RBO or Cohen's kappa where applicable). Additionally, we will discuss the robustness of the rankings and, if data permits, provide bootstrap-based estimates of variability in RBO scores by resampling the image set. We note that the experiment was designed with a fixed set of images, limiting full statistical testing without new experiments, but these additions will clarify the reliability of the findings. revision: partial

  3. Referee: [Discussion] Discussion section: the manuscript does not test or discuss whether proximity to the full rectangular bounding box is a faithful proxy for human judgments of explanation quality. Humans may systematically prefer saliency maps that concentrate on class-discriminative sub-regions rather than uniform coverage of the entire annotated object; if so, high RBO would indicate only that the metric reproduces a particular ranking, not that it captures resemblance to human perception of good explanations.

    Authors: We thank the referee for highlighting this potential limitation in our proxy measure. We will expand the Discussion section to address this explicitly, acknowledging that bounding boxes provide a coarse proxy for object location and that human preferences might favor saliency focused on discriminative features within the object. Our crowdsourcing directly captures human judgments of explanation quality, and the high RBO with Manhattan and Correlation metrics indicates alignment under this setup. We will clarify that our claim is about consistency with human selections when using bounding-box proximity as the similarity measure, and propose extensions using pixel-level annotations for future validation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical ranking comparison against independent human votes

full rationale

The paper performs an experimental comparison: CAM methods generate saliency maps on Chihuahua images; several distance metrics are computed between those maps and human bounding-box annotations; the resulting per-metric rankings are then evaluated against a separate crowdsourced ranking of human preference via RBO. No equations, fitted parameters, or derivations appear in the described pipeline. The human bounding boxes and preference votes constitute external data collected independently of the distance metrics, so the central claim (that Manhattan and Correlation best match human perception) is a direct empirical finding rather than a self-referential or fitted result. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two untested domain assumptions: that crowdsourced human votes reliably proxy 'human perception' of explanation quality, and that bounding-box overlap with a saliency map is a valid proxy for that perception. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Crowdsourced human choices on which saliency map is best constitute a stable and representative ground truth for human perception of explanation quality.
    This ground truth is used to validate which distance metric and which CAM method best resemble human perception.
  • domain assumption Distance between a human bounding box and a saliency map is a meaningful measure of alignment with human perception.
    The paper applies this to create rankings that are then compared to human votes.

pith-pipeline@v0.9.0 · 5625 in / 1571 out tokens · 31581 ms · 2026-05-09T17:24:35.116434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 4 canonical work pages

  1. [1]

    Goodfellow, Y

    I. Goodfellow, Y . Bengio, and A. Courville,Deep Learning. The MIT Press, 2016

  2. [2]

    Explainable deep learning: A field guide for the uninitiated,

    G. Ras, N. Xie, M. van Gerven, and D. Doran, “Explainable deep learning: A field guide for the uninitiated,”J. Artif. Int. Res., vol. 73, May 2022

  3. [3]

    What i cannot predict, i do not understand: A human-centered evaluation framework for ex- plainability methods,

    J. Colin, T. FEL, R. Cadene, and T. Serre, “What i cannot predict, i do not understand: A human-centered evaluation framework for ex- plainability methods,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, 2022, pp. 2832–2845

  4. [4]

    Dissonance between human and machine understanding,

    Z. Zhang, J. Singh, U. Gadiraju, and A. Anand, “Dissonance between human and machine understanding,”Proc. ACM Hum.-Comput. Inter- act., vol. 3, no. CSCW, Nov. 2019

  5. [5]

    To what extent do human explanations of model behavior align with actual model behavior?

    G. Prasad, Y . Nie, M. Bansal, R. Jia, D. Kiela, and A. Williams, “To what extent do human explanations of model behavior align with actual model behavior?” inProceedings of the F ourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, J. Bastings, Y . Belinkov, E. Dupoux, M. Giulianelli, D. Hupkes, Y . Pinter, and H. Sajjad, Eds...

  6. [6]

    Hive: Evaluating the human interpretability of visual explanations,

    S. S. Y . Kim, N. Meister, V . V . Ramaswamy, R. Fong, and O. Rus- sakovsky, “Hive: Evaluating the human interpretability of visual explanations,” inComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. Berlin, Heidelberg: Springer-Verlag, 2022, p. 280–298. TABLE V RANKINGS OF EXPLAINABILITY M...

  7. [7]

    Learn- ing Deep Features for Discriminative Localization,

    B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learn- ing Deep Features for Discriminative Localization,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, Jun. 2016, pp. 2921– 2929

  8. [8]

    Grad-cam: Visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626

  9. [9]

    Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks,

    H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Los Alamitos, CA, USA: IEEE Computer Society, Jun. 2020, pp. 111–119

  10. [10]

    Explainable convolutional neural networks: A taxonomy, review, and future directions,

    R. Ibrahim and M. O. Shafiq, “Explainable convolutional neural networks: A taxonomy, review, and future directions,”ACM Comput. Surv., vol. 55, no. 10, Feb. 2023

  11. [11]

    Backpropagation applied to hand- written zip code recognition,

    Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to hand- written zip code recognition,”Neural Computation, vol. 1, no. 4, pp. 541–551, 1989

  12. [12]

    Ss-cam: Smoothed score-cam for sharper visual feature localization,

    H. Wang, R. Naidu, J. Michael, and S. S. Kundu, “Ss-cam: Smoothed score-cam for sharper visual feature localization,” 2020. [Online]. Available: https://arxiv.org/abs/2006.14255

  13. [13]

    Is-cam: Integrated score-cam for axiomatic-based explanations,

    R. Naidu, A. Ghosh, Y . Maurya, S. R. N. K, and S. S. Kundu, “Is-cam: Integrated score-cam for axiomatic-based explanations,”

  14. [14]

    Available: https://arxiv.org/abs/2010.03023

    [Online]. Available: https://arxiv.org/abs/2010.03023

  15. [15]

    Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,

    A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubrama- nian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 839–847

  16. [16]

    Smooth grad- cam++: An enhanced inference level visualization technique for deep convolutional neural network models.arXiv preprint arXiv:1908.01224, 2019

    D. Omeiza, S. Speakman, C. Cintas, and K. Weldermariam, “Smooth grad-cam++: An enhanced inference level visualization technique for deep convolutional neural network models,” 2019. [Online]. Available: https://arxiv.org/abs/1908.01224

  17. [17]

    Axiom-based grad-cam: Towards accurate visualization and explanation of cnns,

    R. Fu, Q. Hu, X. Dong, Y . Guo, Y . Gao, and B. Li, “Axiom-based grad-cam: Towards accurate visualization and explanation of cnns,”

  18. [18]

    Axiom-based grad-cam: Towards accurate visualization and explanation of cnns

    [Online]. Available: https://arxiv.org/abs/2008.02312

  19. [19]

    Layer- cam: Exploring hierarchical class activation maps for localization,

    P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layer- cam: Exploring hierarchical class activation maps for localization,” IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021

  20. [20]

    Shared interest. . . sometimes: Understanding the alignment between human perception, vision archi- tectures, and saliency map techniques,

    K. Morrison, A. Mehra, and A. Perer, “Shared interest. . . sometimes: Understanding the alignment between human perception, vision archi- tectures, and saliency map techniques,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023, pp. 3776–3781

  21. [21]

    Xrai: Better attributions through regions,

    A. Kapishnikov, T. Bolukbasi, F. Viegas, and M. Terry, “Xrai: Better attributions through regions,” in2019 IEEE/CVF International Con- ference on Computer Vision (ICCV), 2019, pp. 4947–4956

  22. [22]

    Imagenet large scale visual recognition challenge,

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,”Inter- national Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, Dec. 2015

  23. [23]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”arXiv.org, 2015

  24. [24]

    Improved consistent sampling, weighted minhash and l1 sketching,

    S. Ioffe, “Improved consistent sampling, weighted minhash and l1 sketching,” in2010 IEEE International Conference on Data Mining, 2010, pp. 246–255

  25. [25]

    A similarity measure for indefinite rankings,

    W. Webber, A. Moffat, and J. Zobel, “A similarity measure for indefinite rankings,”ACM Trans. Inf. Syst., vol. 28, no. 4, Nov. 2010