How Can One Choose the Best CAM-Based Explainability Method for a CNN Model?
Pith reviewed 2026-05-09 17:24 UTC · model grok-4.3
The pith
Manhattan and correlation distances can identify which CAM explainability methods produce saliency maps that best match human perception.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The results indicate the feasibility of our method to find the explainability method that best resembles human perception. In our experiments, the two metrics that best resemble human perception corresponded to Manhattan and Correlation. Besides, the best explainability methods regarding human perception were LayerCAM, Score-CAM, and IS-CAM.
What carries the argument
Ranking of CAM methods by distance between their saliency maps and human bounding boxes, validated by how closely those rankings match crowdsourced human preference rankings measured with rank-biased overlap.
If this is right
- Manhattan distance produces rankings of explainability methods that align most closely with human votes.
- Correlation distance produces rankings that also align closely with human votes.
- LayerCAM, Score-CAM, and IS-CAM generate the saliency maps that humans judge as best.
- The evaluation procedure can feasibly guide selection of an explainability method for a CNN.
Where Pith is reading between the lines
- The same comparison pipeline could be applied to other image datasets to check whether Manhattan and correlation remain the best distances.
- Domains that require user trust in model decisions could adopt this human-perception filter before deploying an explainer.
- Using eye-tracking records instead of bounding boxes might give a more precise ground truth for future comparisons.
Load-bearing premise
Numerical distances between a fixed human box and a variable-shaped saliency map actually reflect how people judge the quality of an explanation.
What would settle it
A fresh crowdsourcing round on the same images that produces a different top distance metric or different top CAM methods would show the current selection does not reliably capture human perception.
Figures
read the original abstract
In recent years, several advances have been observed in Deep Learning with surprising results. Models in this area have been increasingly used in numerous applications, including those sensitive to human life, which require clear explanations and justifications. Various explainability methods have been proposed, but not many metrics to evaluate these methods. The most commonly used metric is the Intersection over Union (IoU). However, due to the characteristics of the results of the explainability methods, called saliency maps, which do not have a known shape, we hypothesise that there must be a better metric that allows one to find an explainability method that produces results that best resemble the human perception. We propose using different metrics to assess the similarity between human perception and the explanation saliency maps to find a better metric. An investigation was conducted employing a subset of the Chihuahuas images from ImageNet dataset. Several CAM-based explainability methods were used to generate saliency maps for each chihuahua image. Alignment was measured by applying distance metrics between the bounding box of human annotations and the saliency maps produced by each explainability method. Rankings of the best saliency maps were created using the results of the distance metrics and compared to the ranking obtained using people's choice, collected through crowdsourcing, of the best explanation saliency maps for each selected image. Comparison between rankings was performed using the Rank-Biased Overlap (RBO) metric. The results indicate the feasibility of our method to find the explainability method that best resembles human perception. In our experiments, the two metrics that best resemble human perception corresponded to Manhattan and Correlation. Besides, the best explainability methods regarding human perception were LayerCAM, Score-CAM, and IS-CAM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes using distance metrics between human-annotated bounding boxes and continuous saliency maps generated by various CAM explainability methods to produce rankings of those methods; these rankings are then compared via Rank-Biased Overlap (RBO) to rankings derived from crowdsourced human votes on which saliency map best explains each image. The central claim is that this procedure identifies metrics (Manhattan and Correlation) and CAM variants (LayerCAM, Score-CAM, IS-CAM) that best align with human perception of explanation quality.
Significance. If the proxy relationship holds, the work supplies a practical, annotation-based procedure for selecting CAM methods without repeated crowdsourcing, addressing a recognized gap in quantitative evaluation of post-hoc explanations. The use of independent crowdsourced ground truth and RBO for ranking agreement are constructive elements that could be extended to other explanation families.
major comments (3)
- [Methods] Methods section: the procedure for computing distance metrics between binary bounding-box masks and continuous-valued saliency maps is not specified (normalization, any implicit thresholding, resolution alignment, or treatment of saliency mass outside the box). Because all subsequent rankings and RBO comparisons rest on these distances, the omission directly affects the reproducibility and validity of the reported superiority of Manhattan and Correlation.
- [Results] Results section: RBO scores comparing distance-based rankings to human-vote rankings are presented without error bars, participant agreement statistics, or significance tests. This leaves unclear whether the observed advantages for Manhattan and Correlation are statistically distinguishable from other metrics or merely within sampling variability.
- [Discussion] Discussion section: the manuscript does not test or discuss whether proximity to the full rectangular bounding box is a faithful proxy for human judgments of explanation quality. Humans may systematically prefer saliency maps that concentrate on class-discriminative sub-regions rather than uniform coverage of the entire annotated object; if so, high RBO would indicate only that the metric reproduces a particular ranking, not that it captures resemblance to human perception of good explanations.
minor comments (2)
- [Abstract] Abstract and Methods: the size of the Chihuahuas subset, selection criteria, and number of crowdsourcing participants are not reported, impeding assessment of statistical power and reproducibility.
- [Figures] Figures: inclusion of qualitative examples showing saliency maps overlaid with the human bounding boxes would clarify how the distance metrics behave on typical outputs.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and positive assessment of the work's potential. Below, we provide point-by-point responses to the major comments, committing to revisions where appropriate to enhance clarity, reproducibility, and discussion of limitations.
read point-by-point responses
-
Referee: [Methods] Methods section: the procedure for computing distance metrics between binary bounding-box masks and continuous-valued saliency maps is not specified (normalization, any implicit thresholding, resolution alignment, or treatment of saliency mass outside the box). Because all subsequent rankings and RBO comparisons rest on these distances, the omission directly affects the reproducibility and validity of the reported superiority of Manhattan and Correlation.
Authors: We fully agree that the Methods section requires more explicit details to ensure reproducibility. In the revised manuscript, we will add a dedicated paragraph or subsection describing the exact procedure: saliency maps are first normalized to the [0, 1] interval, resized to the same spatial dimensions as the input image and bounding boxes using appropriate interpolation, no thresholding is performed to preserve the continuous nature of the maps, and the distance metrics are applied directly to the full maps, thereby penalizing any saliency mass outside the bounding box regions. revision: yes
-
Referee: [Results] Results section: RBO scores comparing distance-based rankings to human-vote rankings are presented without error bars, participant agreement statistics, or significance tests. This leaves unclear whether the observed advantages for Manhattan and Correlation are statistically distinguishable from other metrics or merely within sampling variability.
Authors: We recognize the importance of statistical validation for the reported RBO differences. In the revision, we will include inter-participant agreement metrics from the crowdsourcing study (e.g., average pairwise RBO or Cohen's kappa where applicable). Additionally, we will discuss the robustness of the rankings and, if data permits, provide bootstrap-based estimates of variability in RBO scores by resampling the image set. We note that the experiment was designed with a fixed set of images, limiting full statistical testing without new experiments, but these additions will clarify the reliability of the findings. revision: partial
-
Referee: [Discussion] Discussion section: the manuscript does not test or discuss whether proximity to the full rectangular bounding box is a faithful proxy for human judgments of explanation quality. Humans may systematically prefer saliency maps that concentrate on class-discriminative sub-regions rather than uniform coverage of the entire annotated object; if so, high RBO would indicate only that the metric reproduces a particular ranking, not that it captures resemblance to human perception of good explanations.
Authors: We thank the referee for highlighting this potential limitation in our proxy measure. We will expand the Discussion section to address this explicitly, acknowledging that bounding boxes provide a coarse proxy for object location and that human preferences might favor saliency focused on discriminative features within the object. Our crowdsourcing directly captures human judgments of explanation quality, and the high RBO with Manhattan and Correlation metrics indicates alignment under this setup. We will clarify that our claim is about consistency with human selections when using bounding-box proximity as the similarity measure, and propose extensions using pixel-level annotations for future validation. revision: partial
Circularity Check
No circularity: purely empirical ranking comparison against independent human votes
full rationale
The paper performs an experimental comparison: CAM methods generate saliency maps on Chihuahua images; several distance metrics are computed between those maps and human bounding-box annotations; the resulting per-metric rankings are then evaluated against a separate crowdsourced ranking of human preference via RBO. No equations, fitted parameters, or derivations appear in the described pipeline. The human bounding boxes and preference votes constitute external data collected independently of the distance metrics, so the central claim (that Manhattan and Correlation best match human perception) is a direct empirical finding rather than a self-referential or fitted result. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Crowdsourced human choices on which saliency map is best constitute a stable and representative ground truth for human perception of explanation quality.
- domain assumption Distance between a human bounding box and a saliency map is a meaningful measure of alignment with human perception.
Reference graph
Works this paper leans on
-
[1]
Goodfellow, Y
I. Goodfellow, Y . Bengio, and A. Courville,Deep Learning. The MIT Press, 2016
2016
-
[2]
Explainable deep learning: A field guide for the uninitiated,
G. Ras, N. Xie, M. van Gerven, and D. Doran, “Explainable deep learning: A field guide for the uninitiated,”J. Artif. Int. Res., vol. 73, May 2022
2022
-
[3]
What i cannot predict, i do not understand: A human-centered evaluation framework for ex- plainability methods,
J. Colin, T. FEL, R. Cadene, and T. Serre, “What i cannot predict, i do not understand: A human-centered evaluation framework for ex- plainability methods,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, 2022, pp. 2832–2845
2022
-
[4]
Dissonance between human and machine understanding,
Z. Zhang, J. Singh, U. Gadiraju, and A. Anand, “Dissonance between human and machine understanding,”Proc. ACM Hum.-Comput. Inter- act., vol. 3, no. CSCW, Nov. 2019
2019
-
[5]
To what extent do human explanations of model behavior align with actual model behavior?
G. Prasad, Y . Nie, M. Bansal, R. Jia, D. Kiela, and A. Williams, “To what extent do human explanations of model behavior align with actual model behavior?” inProceedings of the F ourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, J. Bastings, Y . Belinkov, E. Dupoux, M. Giulianelli, D. Hupkes, Y . Pinter, and H. Sajjad, Eds...
2021
-
[6]
Hive: Evaluating the human interpretability of visual explanations,
S. S. Y . Kim, N. Meister, V . V . Ramaswamy, R. Fong, and O. Rus- sakovsky, “Hive: Evaluating the human interpretability of visual explanations,” inComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. Berlin, Heidelberg: Springer-Verlag, 2022, p. 280–298. TABLE V RANKINGS OF EXPLAINABILITY M...
2022
-
[7]
Learn- ing Deep Features for Discriminative Localization,
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learn- ing Deep Features for Discriminative Localization,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, Jun. 2016, pp. 2921– 2929
2016
-
[8]
Grad-cam: Visual explanations from deep networks via gradient-based localization,
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626
2017
-
[9]
Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks,
H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Los Alamitos, CA, USA: IEEE Computer Society, Jun. 2020, pp. 111–119
2020
-
[10]
Explainable convolutional neural networks: A taxonomy, review, and future directions,
R. Ibrahim and M. O. Shafiq, “Explainable convolutional neural networks: A taxonomy, review, and future directions,”ACM Comput. Surv., vol. 55, no. 10, Feb. 2023
2023
-
[11]
Backpropagation applied to hand- written zip code recognition,
Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to hand- written zip code recognition,”Neural Computation, vol. 1, no. 4, pp. 541–551, 1989
1989
-
[12]
Ss-cam: Smoothed score-cam for sharper visual feature localization,
H. Wang, R. Naidu, J. Michael, and S. S. Kundu, “Ss-cam: Smoothed score-cam for sharper visual feature localization,” 2020. [Online]. Available: https://arxiv.org/abs/2006.14255
-
[13]
Is-cam: Integrated score-cam for axiomatic-based explanations,
R. Naidu, A. Ghosh, Y . Maurya, S. R. N. K, and S. S. Kundu, “Is-cam: Integrated score-cam for axiomatic-based explanations,”
-
[14]
Available: https://arxiv.org/abs/2010.03023
[Online]. Available: https://arxiv.org/abs/2010.03023
-
[15]
Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,
A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubrama- nian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 839–847
2018
-
[16]
D. Omeiza, S. Speakman, C. Cintas, and K. Weldermariam, “Smooth grad-cam++: An enhanced inference level visualization technique for deep convolutional neural network models,” 2019. [Online]. Available: https://arxiv.org/abs/1908.01224
-
[17]
Axiom-based grad-cam: Towards accurate visualization and explanation of cnns,
R. Fu, Q. Hu, X. Dong, Y . Guo, Y . Gao, and B. Li, “Axiom-based grad-cam: Towards accurate visualization and explanation of cnns,”
-
[18]
Axiom-based grad-cam: Towards accurate visualization and explanation of cnns
[Online]. Available: https://arxiv.org/abs/2008.02312
-
[19]
Layer- cam: Exploring hierarchical class activation maps for localization,
P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layer- cam: Exploring hierarchical class activation maps for localization,” IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021
2021
-
[20]
Shared interest. . . sometimes: Understanding the alignment between human perception, vision archi- tectures, and saliency map techniques,
K. Morrison, A. Mehra, and A. Perer, “Shared interest. . . sometimes: Understanding the alignment between human perception, vision archi- tectures, and saliency map techniques,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023, pp. 3776–3781
2023
-
[21]
Xrai: Better attributions through regions,
A. Kapishnikov, T. Bolukbasi, F. Viegas, and M. Terry, “Xrai: Better attributions through regions,” in2019 IEEE/CVF International Con- ference on Computer Vision (ICCV), 2019, pp. 4947–4956
2019
-
[22]
Imagenet large scale visual recognition challenge,
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,”Inter- national Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, Dec. 2015
2015
-
[23]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”arXiv.org, 2015
2015
-
[24]
Improved consistent sampling, weighted minhash and l1 sketching,
S. Ioffe, “Improved consistent sampling, weighted minhash and l1 sketching,” in2010 IEEE International Conference on Data Mining, 2010, pp. 246–255
2010
-
[25]
A similarity measure for indefinite rankings,
W. Webber, A. Moffat, and J. Zobel, “A similarity measure for indefinite rankings,”ACM Trans. Inf. Syst., vol. 28, no. 4, Nov. 2010
2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.