pith. sign in

arxiv: 2606.22901 · v1 · pith:DBBBDLRBnew · submitted 2026-06-22 · 📡 eess.AS · cs.AI· eess.SP

Explainable AI in Speaker Recognition -- Attention Map Visualisation and Evaluation

Pith reviewed 2026-06-26 07:32 UTC · model grok-4.3

classification 📡 eess.AS cs.AIeess.SP
keywords explainable AIspeaker recognitionattention mapsGradCAMLayerCAMRISE-evalneural network visualization
0
0 comments X

The pith

Modified RISE-eval shows GradCAM and LayerCAM each have distinct advantages for visualizing attention in speaker recognition networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the challenge of understanding how neural networks make decisions in speaker recognition by focusing on attention map visualization. It reviews an existing evaluation method for these maps, identifies its limitations, and proposes a modified version called Modified RISE-eval to fix them. The new method is then used to compare two common visualization approaches, GradCAM and LayerCAM, on a speaker recognition model. Results indicate that the two methods perform differently depending on the experimental setup, suggesting neither is universally superior. A reader might care because reliable explanations of AI decisions could improve trust and debugging in voice-based identification systems.

Core claim

The central discovery is the proposal of the Modified RISE-eval algorithm, which addresses shortcomings in prior attention map evaluation techniques. Application of this algorithm to attention maps generated by GradCAM and LayerCAM on speaker recognition networks demonstrates that each method exhibits distinct advantages under different experimental conditions.

What carries the argument

The Modified RISE-eval algorithm, which evaluates attention maps by addressing limitations in randomized input sampling for explanation methods to better assess relevance to speaker identity decisions.

If this is right

  • Attention map evaluation can guide the choice of visualization method based on specific task conditions in speaker recognition.
  • GradCAM may excel in certain audio input scenarios while LayerCAM in others.
  • Systematic evaluation enables more reliable interpretation of neural network decisions for speaker identification.
  • The modified algorithm provides a basis for comparing other CAM-based methods in audio tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could extend to other audio processing tasks where understanding model focus is important, such as speech emotion recognition.
  • If attention maps prove reliable, they might inform model improvements by highlighting unnecessary input dependencies.
  • The findings suggest potential for developing hybrid visualization techniques that combine strengths of both methods.

Load-bearing premise

Neural networks have attention mechanisms analogous to human attention that can be meaningfully captured and evaluated by class activation mapping techniques and the modified RISE-eval algorithm.

What would settle it

An experiment showing that masking the regions highlighted by these attention maps does not affect the speaker recognition accuracy in a manner consistent with the Modified RISE-eval scores.

Figures

Figures reproduced from arXiv: 2606.22901 by Mark D. Plumbley, Wenwu Wang, Yanze Xu.

Figure 1
Figure 1. Figure 1: An overview of the Class Activation Map (CAM) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Randomised Input Sampling for Explanation - Evaluation (RISE-eval) algorithm [24], which comprises [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A demonstration of RISE-eval’s intermediate evaluation results obtained from Li et al.’s paper [39]. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A demonstration for RISE-eval’s overmasking of an [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An overview of experimental procedures decisions. Finally, the rescaled performance changes at each step are summed to obtain the final evaluation score for the attention map Ae. Notably, a small constant ϵ should be added to the denominator in implementation, as users may manually set the sampling ratio r samp to zero at early steps, resulting in Rmask[0] = 0 and leading to division by zero if left unaddr… view at source ↗
Figure 6
Figure 6. Figure 6: Visualisation of attention maps when our speaker recognition network (i.e. a ResNet34 CNN model) classifies a [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Explaining and understanding the decision-making process of artificial intelligence (AI) systems, particularly those implemented by neural networks, falls within the field of explainable AI (XAI). Analogous to the human attention mechanism, neural networks are assumed to possess their own attention mechanisms that selectively process information during decision-making. This work proposes to study one XAI topic: analysing and visualising the attention mechanisms of neural networks. Our experiments are performed on speaker recognition neural networks that are trained to identify speaker identity from a given utterance. Previous studies have widely used class activation map (CAM)-based methods to analyse and visualise the attention mechanisms of neural networks. Each of these methods produces an attention map for each network input, highlighting which input regions are selectively processed when the speaker recognition network makes decisions. However, the evaluation of attention maps produced by these methods remains largely underexplored. This work systematically reviews an existing attention map evaluation algorithm, establishing key concepts and identifying its shortcomings. On the basis of this existing evaluation algorithm, a new version is then proposed to address the identified shortcomings, called the Modified Randomised Input Sampling for Explanation - Evaluation algorithm (Modified RISE-eval). Using Modified RISE-eval, we evaluate the attention maps produced by two representative CAM-based methods, GradCAM and LayerCAM, applied to a certain speaker recognition network. The evaluation results demonstrate that GradCAM and LayerCAM each exhibit distinct advantages when applied under different experimental conditions in the speaker recognition task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript reviews shortcomings of the existing RISE-eval algorithm for assessing attention maps, proposes Modified RISE-eval to address them, and applies the modified evaluator to compare attention maps generated by GradCAM and LayerCAM on a speaker recognition network. The central empirical claim is that the two CAM methods exhibit distinct advantages under different experimental conditions.

Significance. If the Modified RISE-eval is shown to be a valid improvement and the reported advantages are reproducible, the work supplies a concrete evaluation protocol for XAI methods in speaker recognition. This is useful for audio biometrics applications where interpretability matters, and the explicit grounding in an existing published evaluator (with stated modifications) is a strength.

minor comments (3)
  1. [Abstract] Abstract: the network is referred to only as 'a certain speaker recognition network.' The full manuscript should name the architecture, training corpus, and input representation (e.g., spectrogram type) so that the evaluation conditions can be replicated.
  2. The description of the modifications that turn RISE-eval into Modified RISE-eval should be accompanied by an explicit side-by-side comparison (perhaps a table) listing each identified shortcoming and the precise change made to remedy it.
  3. Results section: the claim of 'distinct advantages under different experimental conditions' needs to be supported by the actual quantitative scores (e.g., the modified RISE-eval metric values) rather than a qualitative summary; include the relevant table or figure reference.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their constructive review and positive recommendation for minor revision. The assessment correctly identifies the core contributions of the manuscript in reviewing RISE-eval shortcomings, proposing Modified RISE-eval, and demonstrating distinct advantages of GradCAM versus LayerCAM under different conditions in speaker recognition.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reviews shortcomings of an existing published attention-map evaluation algorithm (RISE-eval), proposes explicit modifications to create Modified RISE-eval, and then applies the modified evaluator to compare GradCAM and LayerCAM outputs on one speaker-recognition network. No equations, fitted parameters, or self-citation chains reduce the reported empirical comparison to a quantity defined by the authors' own inputs. The core premise that networks possess attention-like mechanisms is explicitly labeled an assumption rather than a derived result. The evaluation rests on external benchmarks and the stated modifications, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions in neural network interpretability rather than new postulates. No free parameters, axioms beyond domain standards, or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Neural networks possess attention mechanisms that can be visualized via class activation map methods
    Stated in the opening of the abstract as the foundational premise for the entire study.

pith-pipeline@v0.9.1-grok · 5806 in / 1359 out tokens · 19310 ms · 2026-06-26T07:32:36.736833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 1 canonical work pages

  1. [1]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  2. [2]

    Deep learning,

    Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”nature, vol. 521, no. 7553, pp. 436–444, 2015

  3. [3]

    Darpa’s explainable artificial intelligence (xai) program,

    D. Gunning and D. Aha, “Darpa’s explainable artificial intelligence (xai) program,”AI magazine, vol. 40, no. 2, pp. 44–58, 2019

  4. [4]

    Explainable ai: A brief survey on history, research areas, approaches and challenges,

    F. Xu, H. Uszkoreit, Y . Du, W. Fan, D. Zhao, and J. Zhu, “Explainable ai: A brief survey on history, research areas, approaches and challenges,” inNatural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9– 14, 2019, Proceedings, Part II 8. Springer, 2019, pp. 563–574

  5. [5]

    Explainable ai: A review of machine learning interpretability methods,

    P. Linardatos, V . Papastefanopoulos, and S. Kotsiantis, “Explainable ai: A review of machine learning interpretability methods,”Entropy, vol. 23, no. 1, p. 18, 2020

  6. [6]

    An overview of the supervised machine learning methods,

    V . Nasteski, “An overview of the supervised machine learning methods,” Horizons. b, vol. 4, no. 51-62, p. 56, 2017

  7. [7]

    How humans learn and represent networks,

    C. W. Lynn and D. S. Bassett, “How humans learn and represent networks,”Proceedings of the National Academy of Sciences, vol. 117, no. 47, pp. 29 407–29 415, 2020

  8. [8]

    Selective attention

    W. A. Johnston and V . J. Dark, “Selective attention.”Annual review of psychology, 1986

  9. [9]

    Explainable ai in speaker recognition–making latent representations understandable,

    Y . Xu, W. Wang, and M. D. Plumbley, “Explainable ai in speaker recognition–making latent representations understandable,” arXiv preprint arXiv:2604.23354, 2026

  10. [10]

    Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,

    W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,”arXiv preprint arXiv:1804.05160, 2018

  11. [11]

    V oxceleb: a large-scale speaker identification dataset,

    A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large-scale speaker identification dataset,”arXiv preprint arXiv:1706.08612, 2017

  12. [12]

    In defence of metric learning for speaker recognition,

    J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,”arXiv preprint arXiv:2003.11982, 2020

  13. [13]

    Learning deep features for discriminative localization,

    B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921– 2929

  14. [14]

    Master-cam: Multi-scale fusion guided by master map for high-quality class activation maps,

    X. Zhou, Y . Li, G. Cao, and W. Cao, “Master-cam: Multi-scale fusion guided by master map for high-quality class activation maps,”Displays, vol. 76, p. 102339, 2023

  15. [15]

    Layercam: Exploring hierarchical class activation maps for localization,

    P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layercam: Exploring hierarchical class activation maps for localization,”IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021

  16. [16]

    Score-cam: Score-weighted visual explanations for convo- lutional neural networks,

    H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convo- lutional neural networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 24–25

  17. [17]

    Grad-cam: Why did you say that?

    R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “Grad-cam: Why did you say that?”arXiv preprint arXiv:1611.07450, 2016

  18. [18]

    Cameras: Enhanced resolution and sanity preserving class activation mapping for image saliency,

    M. A. Jalwana, N. Akhtar, M. Bennamoun, and A. Mian, “Cameras: Enhanced resolution and sanity preserving class activation mapping for image saliency,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 327–16 336

  19. [19]

    A unified approach to interpreting model predictions,

    S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,”Advances in neural information processing systems, vol. 30, 2017

  20. [20]

    ” why should i trust you?

    M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” inProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144

  21. [21]

    Can we trust explainable ai methods on asr? an evaluation on phoneme recognition,

    X. Wu, P. Bell, and A. Rajan, “Can we trust explainable ai methods on asr? an evaluation on phoneme recognition,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10 296–10 300

  22. [22]

    Neural network interpretability with layer-wise relevance propagation: Novel techniques for neuron selection and visualization,

    D. Bhati, F. Neha, M. Amiruzzaman, A. Guercio, D. K. Shukla, and B. Ward, “Neural network interpretability with layer-wise relevance propagation: Novel techniques for neuron selection and visualization,” in2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), 2025, pp. 00 441–00 447

  23. [23]

    Explainable ai without interpretable model,

    K. Fr ¨amling, “Explainable ai without interpretable model,”arXiv preprint arXiv:2009.13996, 2020

  24. [24]

    Rise: Randomized input sampling for explanation of black-box models,

    V . Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,”arXiv preprint arXiv:1806.07421, 2018

  25. [25]

    Slrp: Improved heatmap genera- tion via selective layer-wise relevance propagation,

    Y .-J. Jung, S.-H. Han, and H.-J. Choi, “Slrp: Improved heatmap genera- tion via selective layer-wise relevance propagation,”Electronics Letters, vol. 57, no. 10, pp. 393–396, 2021

  26. [26]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  27. [27]

    Montavon, A

    G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. M ¨uller, Layer-Wise Relevance Propagation: An Overview. Cham: Springer International Publishing, 2019, pp. 193–209. [Online]. Available: https://doi.org/10.1007/978-3-030-28954-6 10

  28. [28]

    V oxceleb2: Deep speaker recognition,

    J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,”arXiv preprint arXiv:1806.05622, 2018

  29. [29]

    Vision transformer with attention map hallucination and ffn compaction,

    H. Xu, Z. Zhou, D. He, F. Li, and J. Wang, “Vision transformer with attention map hallucination and ffn compaction,”arXiv preprint arXiv:2306.10875, 2023

  30. [30]

    Attentive pooling networks,

    C. d. Santos, M. Tan, B. Xiang, and B. Zhou, “Attentive pooling networks,”arXiv preprint arXiv:1602.03609, 2016

  31. [31]

    Vision transformer with attentive pooling for robust facial expression recognition,

    F. Xue, Q. Wang, Z. Tan, Z. Ma, and G. Guo, “Vision transformer with attentive pooling for robust facial expression recognition,”IEEE Transactions on Affective Computing, vol. 14, no. 4, pp. 3244–3256, 2022

  32. [32]

    Opening the black box of deep neural networks via information,

    R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,”arXiv preprint arXiv:1703.00810, 2017

  33. [33]

    Contextual Importance and Utility: A Theoretical Foun- dation,

    K. Fr ¨amling, “Contextual Importance and Utility: A Theoretical Foun- dation,” inAI 2021: Advances in Artificial Intelligence, G. Long, X. Yu, and S. Wang, Eds. Cham: Springer International Publishing, 2022, pp. 117–128

  34. [34]

    Hopfield networks is all you need,

    H. Ramsauer, B. Sch ¨afl, J. Lehner, P. Seidl, M. Widrich, T. Adler, L. Gruber, M. Holzleitner, M. Pavlovi ´c, G. K. Sandveet al., “Hopfield networks is all you need,”arXiv preprint arXiv:2008.02217, 2020. 15

  35. [35]

    How to explain individual classification decisions,

    D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. M ¨uller, “How to explain individual classification decisions,” J. Mach. Learn. Res., vol. 11, p. 1803–1831, Aug. 2010

  36. [36]

    Visualizing and understanding convolu- tional networks,

    M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- tional networks,” inComputer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 818–833

  37. [37]

    Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,

    A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in2018 IEEE winter conference on applica- tions of computer vision (WACV). IEEE, 2018, pp. 839–847

  38. [38]

    A model of saliency-based visual at- tention for rapid scene analysis,

    L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual at- tention for rapid scene analysis,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998

  39. [39]

    Reliable visualization for deep speaker recognition,

    P. Li, L. Li, A. Hamdulla, and D. Wang, “Reliable visualization for deep speaker recognition,”arXiv preprint arXiv:2204.03852, 2022

  40. [40]

    Grad-cam: Visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 618–626

  41. [41]

    Markedly enhanced analysis of mass spectrometry images using weakly supervised machine learning,

    W. Gardner, D. A. Winkler, S. E. Bamford, B. W. Muir, and P. J. Pigram, “Markedly enhanced analysis of mass spectrometry images using weakly supervised machine learning,”Small Methods, vol. 8, no. 7, p. 2301230, 2024

  42. [42]

    Visual explanation and robustness assessment optimization of saliency maps for image classification,

    X. Xu and J. Mo, “Visual explanation and robustness assessment optimization of saliency maps for image classification,”The Visual Computer, vol. 39, no. 12, pp. 6097–6113, 2023

  43. [43]

    Prototypical networks for few-shot learning,

    J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,”Advances in neural information processing systems, vol. 30, 2017