Explainable AI in Speaker Recognition -- Attention Map Visualisation and Evaluation

Mark D. Plumbley; Wenwu Wang; Yanze Xu

arxiv: 2606.22901 · v1 · pith:DBBBDLRBnew · submitted 2026-06-22 · 📡 eess.AS · cs.AI· eess.SP

Explainable AI in Speaker Recognition -- Attention Map Visualisation and Evaluation

Yanze Xu , Mark D. Plumbley , Wenwu Wang This is my paper

Pith reviewed 2026-06-26 07:32 UTC · model grok-4.3

classification 📡 eess.AS cs.AIeess.SP

keywords explainable AIspeaker recognitionattention mapsGradCAMLayerCAMRISE-evalneural network visualization

0 comments

The pith

Modified RISE-eval shows GradCAM and LayerCAM each have distinct advantages for visualizing attention in speaker recognition networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the challenge of understanding how neural networks make decisions in speaker recognition by focusing on attention map visualization. It reviews an existing evaluation method for these maps, identifies its limitations, and proposes a modified version called Modified RISE-eval to fix them. The new method is then used to compare two common visualization approaches, GradCAM and LayerCAM, on a speaker recognition model. Results indicate that the two methods perform differently depending on the experimental setup, suggesting neither is universally superior. A reader might care because reliable explanations of AI decisions could improve trust and debugging in voice-based identification systems.

Core claim

The central discovery is the proposal of the Modified RISE-eval algorithm, which addresses shortcomings in prior attention map evaluation techniques. Application of this algorithm to attention maps generated by GradCAM and LayerCAM on speaker recognition networks demonstrates that each method exhibits distinct advantages under different experimental conditions.

What carries the argument

The Modified RISE-eval algorithm, which evaluates attention maps by addressing limitations in randomized input sampling for explanation methods to better assess relevance to speaker identity decisions.

If this is right

Attention map evaluation can guide the choice of visualization method based on specific task conditions in speaker recognition.
GradCAM may excel in certain audio input scenarios while LayerCAM in others.
Systematic evaluation enables more reliable interpretation of neural network decisions for speaker identification.
The modified algorithm provides a basis for comparing other CAM-based methods in audio tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could extend to other audio processing tasks where understanding model focus is important, such as speech emotion recognition.
If attention maps prove reliable, they might inform model improvements by highlighting unnecessary input dependencies.
The findings suggest potential for developing hybrid visualization techniques that combine strengths of both methods.

Load-bearing premise

Neural networks have attention mechanisms analogous to human attention that can be meaningfully captured and evaluated by class activation mapping techniques and the modified RISE-eval algorithm.

What would settle it

An experiment showing that masking the regions highlighted by these attention maps does not affect the speaker recognition accuracy in a manner consistent with the Modified RISE-eval scores.

Figures

Figures reproduced from arXiv: 2606.22901 by Mark D. Plumbley, Wenwu Wang, Yanze Xu.

**Figure 2.** Figure 2: Overview of the Randomised Input Sampling for Explanation - Evaluation (RISE-eval) algorithm [24], which comprises [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: A demonstration of RISE-eval’s intermediate evaluation results obtained from Li et al.’s paper [39]. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: A demonstration for RISE-eval’s overmasking of an [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: An overview of experimental procedures decisions. Finally, the rescaled performance changes at each step are summed to obtain the final evaluation score for the attention map Ae. Notably, a small constant ϵ should be added to the denominator in implementation, as users may manually set the sampling ratio r samp to zero at early steps, resulting in Rmask[0] = 0 and leading to division by zero if left unaddr… view at source ↗

**Figure 6.** Figure 6: Visualisation of attention maps when our speaker recognition network (i.e. a ResNet34 CNN model) classifies a [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Explaining and understanding the decision-making process of artificial intelligence (AI) systems, particularly those implemented by neural networks, falls within the field of explainable AI (XAI). Analogous to the human attention mechanism, neural networks are assumed to possess their own attention mechanisms that selectively process information during decision-making. This work proposes to study one XAI topic: analysing and visualising the attention mechanisms of neural networks. Our experiments are performed on speaker recognition neural networks that are trained to identify speaker identity from a given utterance. Previous studies have widely used class activation map (CAM)-based methods to analyse and visualise the attention mechanisms of neural networks. Each of these methods produces an attention map for each network input, highlighting which input regions are selectively processed when the speaker recognition network makes decisions. However, the evaluation of attention maps produced by these methods remains largely underexplored. This work systematically reviews an existing attention map evaluation algorithm, establishing key concepts and identifying its shortcomings. On the basis of this existing evaluation algorithm, a new version is then proposed to address the identified shortcomings, called the Modified Randomised Input Sampling for Explanation - Evaluation algorithm (Modified RISE-eval). Using Modified RISE-eval, we evaluate the attention maps produced by two representative CAM-based methods, GradCAM and LayerCAM, applied to a certain speaker recognition network. The evaluation results demonstrate that GradCAM and LayerCAM each exhibit distinct advantages when applied under different experimental conditions in the speaker recognition task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tweaks RISE-eval to fix a few evaluation issues and shows GradCAM and LayerCAM trade off on speaker recognition, but stays narrow and assumption-heavy.

read the letter

The main thing to know is that the authors review RISE-eval, flag some shortcomings, propose a modified version, and then use it to compare GradCAM and LayerCAM on one speaker recognition network. The abstract reports that each method shows distinct advantages under different conditions.

The review of the prior algorithm and the explicit modifications are the clearest new piece. They treat the attention analogy as an assumption rather than a result, which keeps the framing honest. Applying the evaluator to audio models is a reasonable next step for that line of work.

The soft spots are straightforward. Everything stays inside the CAM family and the test is limited to a single network, so we do not learn how far the modification travels or whether it changes conclusions on other tasks. The abstract gives no quantitative details on how large the improvement is or exactly which shortcomings were fixed, which makes it hard to judge the practical gain.

This paper is for the small set of people already working on XAI tools for speaker recognition or similar audio tasks. Readers who know the original RISE-eval will see the incremental value quickly. It shows clear engagement with the existing method and no internal contradictions in the stated claim.

I would send it for peer review. The modification is the sort of targeted fix that benefits from referee input on whether the changes are well justified and reproducible.

Referee Report

0 major / 3 minor

Summary. The manuscript reviews shortcomings of the existing RISE-eval algorithm for assessing attention maps, proposes Modified RISE-eval to address them, and applies the modified evaluator to compare attention maps generated by GradCAM and LayerCAM on a speaker recognition network. The central empirical claim is that the two CAM methods exhibit distinct advantages under different experimental conditions.

Significance. If the Modified RISE-eval is shown to be a valid improvement and the reported advantages are reproducible, the work supplies a concrete evaluation protocol for XAI methods in speaker recognition. This is useful for audio biometrics applications where interpretability matters, and the explicit grounding in an existing published evaluator (with stated modifications) is a strength.

minor comments (3)

[Abstract] Abstract: the network is referred to only as 'a certain speaker recognition network.' The full manuscript should name the architecture, training corpus, and input representation (e.g., spectrogram type) so that the evaluation conditions can be replicated.
The description of the modifications that turn RISE-eval into Modified RISE-eval should be accompanied by an explicit side-by-side comparison (perhaps a table) listing each identified shortcoming and the precise change made to remedy it.
Results section: the claim of 'distinct advantages under different experimental conditions' needs to be supported by the actual quantitative scores (e.g., the modified RISE-eval metric values) rather than a qualitative summary; include the relevant table or figure reference.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their constructive review and positive recommendation for minor revision. The assessment correctly identifies the core contributions of the manuscript in reviewing RISE-eval shortcomings, proposing Modified RISE-eval, and demonstrating distinct advantages of GradCAM versus LayerCAM under different conditions in speaker recognition.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reviews shortcomings of an existing published attention-map evaluation algorithm (RISE-eval), proposes explicit modifications to create Modified RISE-eval, and then applies the modified evaluator to compare GradCAM and LayerCAM outputs on one speaker-recognition network. No equations, fitted parameters, or self-citation chains reduce the reported empirical comparison to a quantity defined by the authors' own inputs. The core premise that networks possess attention-like mechanisms is explicitly labeled an assumption rather than a derived result. The evaluation rests on external benchmarks and the stated modifications, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions in neural network interpretability rather than new postulates. No free parameters, axioms beyond domain standards, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Neural networks possess attention mechanisms that can be visualized via class activation map methods
Stated in the opening of the abstract as the foundational premise for the entire study.

pith-pipeline@v0.9.1-grok · 5806 in / 1359 out tokens · 19310 ms · 2026-06-26T07:32:36.736833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 1 canonical work pages

[1]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016
[2]

Deep learning,

Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”nature, vol. 521, no. 7553, pp. 436–444, 2015

2015
[3]

Darpa’s explainable artificial intelligence (xai) program,

D. Gunning and D. Aha, “Darpa’s explainable artificial intelligence (xai) program,”AI magazine, vol. 40, no. 2, pp. 44–58, 2019

2019
[4]

Explainable ai: A brief survey on history, research areas, approaches and challenges,

F. Xu, H. Uszkoreit, Y . Du, W. Fan, D. Zhao, and J. Zhu, “Explainable ai: A brief survey on history, research areas, approaches and challenges,” inNatural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9– 14, 2019, Proceedings, Part II 8. Springer, 2019, pp. 563–574

2019
[5]

Explainable ai: A review of machine learning interpretability methods,

P. Linardatos, V . Papastefanopoulos, and S. Kotsiantis, “Explainable ai: A review of machine learning interpretability methods,”Entropy, vol. 23, no. 1, p. 18, 2020

2020
[6]

An overview of the supervised machine learning methods,

V . Nasteski, “An overview of the supervised machine learning methods,” Horizons. b, vol. 4, no. 51-62, p. 56, 2017

2017
[7]

How humans learn and represent networks,

C. W. Lynn and D. S. Bassett, “How humans learn and represent networks,”Proceedings of the National Academy of Sciences, vol. 117, no. 47, pp. 29 407–29 415, 2020

2020
[8]

Selective attention

W. A. Johnston and V . J. Dark, “Selective attention.”Annual review of psychology, 1986

1986
[9]

Explainable ai in speaker recognition–making latent representations understandable,

Y . Xu, W. Wang, and M. D. Plumbley, “Explainable ai in speaker recognition–making latent representations understandable,” arXiv preprint arXiv:2604.23354, 2026

Pith/arXiv arXiv 2026
[10]

Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,

W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,”arXiv preprint arXiv:1804.05160, 2018

Pith/arXiv arXiv 2018
[11]

V oxceleb: a large-scale speaker identification dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large-scale speaker identification dataset,”arXiv preprint arXiv:1706.08612, 2017

arXiv 2017
[12]

In defence of metric learning for speaker recognition,

J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,”arXiv preprint arXiv:2003.11982, 2020

arXiv 2003
[13]

Learning deep features for discriminative localization,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921– 2929

2016
[14]

Master-cam: Multi-scale fusion guided by master map for high-quality class activation maps,

X. Zhou, Y . Li, G. Cao, and W. Cao, “Master-cam: Multi-scale fusion guided by master map for high-quality class activation maps,”Displays, vol. 76, p. 102339, 2023

2023
[15]

Layercam: Exploring hierarchical class activation maps for localization,

P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layercam: Exploring hierarchical class activation maps for localization,”IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021

2021
[16]

Score-cam: Score-weighted visual explanations for convo- lutional neural networks,

H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convo- lutional neural networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 24–25

2020
[17]

Grad-cam: Why did you say that?

R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “Grad-cam: Why did you say that?”arXiv preprint arXiv:1611.07450, 2016

Pith/arXiv arXiv 2016
[18]

Cameras: Enhanced resolution and sanity preserving class activation mapping for image saliency,

M. A. Jalwana, N. Akhtar, M. Bennamoun, and A. Mian, “Cameras: Enhanced resolution and sanity preserving class activation mapping for image saliency,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 327–16 336

2021
[19]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,”Advances in neural information processing systems, vol. 30, 2017

2017
[20]

” why should i trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” inProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144

2016
[21]

Can we trust explainable ai methods on asr? an evaluation on phoneme recognition,

X. Wu, P. Bell, and A. Rajan, “Can we trust explainable ai methods on asr? an evaluation on phoneme recognition,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10 296–10 300

2024
[22]

Neural network interpretability with layer-wise relevance propagation: Novel techniques for neuron selection and visualization,

D. Bhati, F. Neha, M. Amiruzzaman, A. Guercio, D. K. Shukla, and B. Ward, “Neural network interpretability with layer-wise relevance propagation: Novel techniques for neuron selection and visualization,” in2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), 2025, pp. 00 441–00 447

2025
[23]

Explainable ai without interpretable model,

K. Fr ¨amling, “Explainable ai without interpretable model,”arXiv preprint arXiv:2009.13996, 2020

arXiv 2009
[24]

Rise: Randomized input sampling for explanation of black-box models,

V . Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,”arXiv preprint arXiv:1806.07421, 2018

Pith/arXiv arXiv 2018
[25]

Slrp: Improved heatmap genera- tion via selective layer-wise relevance propagation,

Y .-J. Jung, S.-H. Han, and H.-J. Choi, “Slrp: Improved heatmap genera- tion via selective layer-wise relevance propagation,”Electronics Letters, vol. 57, no. 10, pp. 393–396, 2021

2021
[26]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[27]

Montavon, A

G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. M ¨uller, Layer-Wise Relevance Propagation: An Overview. Cham: Springer International Publishing, 2019, pp. 193–209. [Online]. Available: https://doi.org/10.1007/978-3-030-28954-6 10

work page doi:10.1007/978-3-030-28954-6 2019
[28]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,”arXiv preprint arXiv:1806.05622, 2018

arXiv 2018
[29]

Vision transformer with attention map hallucination and ffn compaction,

H. Xu, Z. Zhou, D. He, F. Li, and J. Wang, “Vision transformer with attention map hallucination and ffn compaction,”arXiv preprint arXiv:2306.10875, 2023

arXiv 2023
[30]

Attentive pooling networks,

C. d. Santos, M. Tan, B. Xiang, and B. Zhou, “Attentive pooling networks,”arXiv preprint arXiv:1602.03609, 2016

Pith/arXiv arXiv 2016
[31]

Vision transformer with attentive pooling for robust facial expression recognition,

F. Xue, Q. Wang, Z. Tan, Z. Ma, and G. Guo, “Vision transformer with attentive pooling for robust facial expression recognition,”IEEE Transactions on Affective Computing, vol. 14, no. 4, pp. 3244–3256, 2022

2022
[32]

Opening the black box of deep neural networks via information,

R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,”arXiv preprint arXiv:1703.00810, 2017

Pith/arXiv arXiv 2017
[33]

Contextual Importance and Utility: A Theoretical Foun- dation,

K. Fr ¨amling, “Contextual Importance and Utility: A Theoretical Foun- dation,” inAI 2021: Advances in Artificial Intelligence, G. Long, X. Yu, and S. Wang, Eds. Cham: Springer International Publishing, 2022, pp. 117–128

2021
[34]

Hopfield networks is all you need,

H. Ramsauer, B. Sch ¨afl, J. Lehner, P. Seidl, M. Widrich, T. Adler, L. Gruber, M. Holzleitner, M. Pavlovi ´c, G. K. Sandveet al., “Hopfield networks is all you need,”arXiv preprint arXiv:2008.02217, 2020. 15

Pith/arXiv arXiv 2008
[35]

How to explain individual classification decisions,

D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. M ¨uller, “How to explain individual classification decisions,” J. Mach. Learn. Res., vol. 11, p. 1803–1831, Aug. 2010

2010
[36]

Visualizing and understanding convolu- tional networks,

M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- tional networks,” inComputer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 818–833

2014
[37]

Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,

A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in2018 IEEE winter conference on applica- tions of computer vision (WACV). IEEE, 2018, pp. 839–847

2018
[38]

A model of saliency-based visual at- tention for rapid scene analysis,

L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual at- tention for rapid scene analysis,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998

1998
[39]

Reliable visualization for deep speaker recognition,

P. Li, L. Li, A. Hamdulla, and D. Wang, “Reliable visualization for deep speaker recognition,”arXiv preprint arXiv:2204.03852, 2022

arXiv 2022
[40]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 618–626

2017
[41]

Markedly enhanced analysis of mass spectrometry images using weakly supervised machine learning,

W. Gardner, D. A. Winkler, S. E. Bamford, B. W. Muir, and P. J. Pigram, “Markedly enhanced analysis of mass spectrometry images using weakly supervised machine learning,”Small Methods, vol. 8, no. 7, p. 2301230, 2024

2024
[42]

Visual explanation and robustness assessment optimization of saliency maps for image classification,

X. Xu and J. Mo, “Visual explanation and robustness assessment optimization of saliency maps for image classification,”The Visual Computer, vol. 39, no. 12, pp. 6097–6113, 2023

2023
[43]

Prototypical networks for few-shot learning,

J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,”Advances in neural information processing systems, vol. 30, 2017

2017

[1] [1]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016

[2] [2]

Deep learning,

Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”nature, vol. 521, no. 7553, pp. 436–444, 2015

2015

[3] [3]

Darpa’s explainable artificial intelligence (xai) program,

D. Gunning and D. Aha, “Darpa’s explainable artificial intelligence (xai) program,”AI magazine, vol. 40, no. 2, pp. 44–58, 2019

2019

[4] [4]

Explainable ai: A brief survey on history, research areas, approaches and challenges,

F. Xu, H. Uszkoreit, Y . Du, W. Fan, D. Zhao, and J. Zhu, “Explainable ai: A brief survey on history, research areas, approaches and challenges,” inNatural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9– 14, 2019, Proceedings, Part II 8. Springer, 2019, pp. 563–574

2019

[5] [5]

Explainable ai: A review of machine learning interpretability methods,

P. Linardatos, V . Papastefanopoulos, and S. Kotsiantis, “Explainable ai: A review of machine learning interpretability methods,”Entropy, vol. 23, no. 1, p. 18, 2020

2020

[6] [6]

An overview of the supervised machine learning methods,

V . Nasteski, “An overview of the supervised machine learning methods,” Horizons. b, vol. 4, no. 51-62, p. 56, 2017

2017

[7] [7]

How humans learn and represent networks,

C. W. Lynn and D. S. Bassett, “How humans learn and represent networks,”Proceedings of the National Academy of Sciences, vol. 117, no. 47, pp. 29 407–29 415, 2020

2020

[8] [8]

Selective attention

W. A. Johnston and V . J. Dark, “Selective attention.”Annual review of psychology, 1986

1986

[9] [9]

Explainable ai in speaker recognition–making latent representations understandable,

Y . Xu, W. Wang, and M. D. Plumbley, “Explainable ai in speaker recognition–making latent representations understandable,” arXiv preprint arXiv:2604.23354, 2026

Pith/arXiv arXiv 2026

[10] [10]

Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,

W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,”arXiv preprint arXiv:1804.05160, 2018

Pith/arXiv arXiv 2018

[11] [11]

V oxceleb: a large-scale speaker identification dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large-scale speaker identification dataset,”arXiv preprint arXiv:1706.08612, 2017

arXiv 2017

[12] [12]

In defence of metric learning for speaker recognition,

J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,”arXiv preprint arXiv:2003.11982, 2020

arXiv 2003

[13] [13]

Learning deep features for discriminative localization,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921– 2929

2016

[14] [14]

Master-cam: Multi-scale fusion guided by master map for high-quality class activation maps,

X. Zhou, Y . Li, G. Cao, and W. Cao, “Master-cam: Multi-scale fusion guided by master map for high-quality class activation maps,”Displays, vol. 76, p. 102339, 2023

2023

[15] [15]

Layercam: Exploring hierarchical class activation maps for localization,

P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layercam: Exploring hierarchical class activation maps for localization,”IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021

2021

[16] [16]

Score-cam: Score-weighted visual explanations for convo- lutional neural networks,

H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convo- lutional neural networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 24–25

2020

[17] [17]

Grad-cam: Why did you say that?

R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “Grad-cam: Why did you say that?”arXiv preprint arXiv:1611.07450, 2016

Pith/arXiv arXiv 2016

[18] [18]

Cameras: Enhanced resolution and sanity preserving class activation mapping for image saliency,

M. A. Jalwana, N. Akhtar, M. Bennamoun, and A. Mian, “Cameras: Enhanced resolution and sanity preserving class activation mapping for image saliency,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 327–16 336

2021

[19] [19]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,”Advances in neural information processing systems, vol. 30, 2017

2017

[20] [20]

” why should i trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” inProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144

2016

[21] [21]

Can we trust explainable ai methods on asr? an evaluation on phoneme recognition,

X. Wu, P. Bell, and A. Rajan, “Can we trust explainable ai methods on asr? an evaluation on phoneme recognition,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10 296–10 300

2024

[22] [22]

Neural network interpretability with layer-wise relevance propagation: Novel techniques for neuron selection and visualization,

D. Bhati, F. Neha, M. Amiruzzaman, A. Guercio, D. K. Shukla, and B. Ward, “Neural network interpretability with layer-wise relevance propagation: Novel techniques for neuron selection and visualization,” in2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), 2025, pp. 00 441–00 447

2025

[23] [23]

Explainable ai without interpretable model,

K. Fr ¨amling, “Explainable ai without interpretable model,”arXiv preprint arXiv:2009.13996, 2020

arXiv 2009

[24] [24]

Rise: Randomized input sampling for explanation of black-box models,

V . Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,”arXiv preprint arXiv:1806.07421, 2018

Pith/arXiv arXiv 2018

[25] [25]

Slrp: Improved heatmap genera- tion via selective layer-wise relevance propagation,

Y .-J. Jung, S.-H. Han, and H.-J. Choi, “Slrp: Improved heatmap genera- tion via selective layer-wise relevance propagation,”Electronics Letters, vol. 57, no. 10, pp. 393–396, 2021

2021

[26] [26]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017

[27] [27]

Montavon, A

G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. M ¨uller, Layer-Wise Relevance Propagation: An Overview. Cham: Springer International Publishing, 2019, pp. 193–209. [Online]. Available: https://doi.org/10.1007/978-3-030-28954-6 10

work page doi:10.1007/978-3-030-28954-6 2019

[28] [28]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,”arXiv preprint arXiv:1806.05622, 2018

arXiv 2018

[29] [29]

Vision transformer with attention map hallucination and ffn compaction,

H. Xu, Z. Zhou, D. He, F. Li, and J. Wang, “Vision transformer with attention map hallucination and ffn compaction,”arXiv preprint arXiv:2306.10875, 2023

arXiv 2023

[30] [30]

Attentive pooling networks,

C. d. Santos, M. Tan, B. Xiang, and B. Zhou, “Attentive pooling networks,”arXiv preprint arXiv:1602.03609, 2016

Pith/arXiv arXiv 2016

[31] [31]

Vision transformer with attentive pooling for robust facial expression recognition,

F. Xue, Q. Wang, Z. Tan, Z. Ma, and G. Guo, “Vision transformer with attentive pooling for robust facial expression recognition,”IEEE Transactions on Affective Computing, vol. 14, no. 4, pp. 3244–3256, 2022

2022

[32] [32]

Opening the black box of deep neural networks via information,

R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,”arXiv preprint arXiv:1703.00810, 2017

Pith/arXiv arXiv 2017

[33] [33]

Contextual Importance and Utility: A Theoretical Foun- dation,

K. Fr ¨amling, “Contextual Importance and Utility: A Theoretical Foun- dation,” inAI 2021: Advances in Artificial Intelligence, G. Long, X. Yu, and S. Wang, Eds. Cham: Springer International Publishing, 2022, pp. 117–128

2021

[34] [34]

Hopfield networks is all you need,

H. Ramsauer, B. Sch ¨afl, J. Lehner, P. Seidl, M. Widrich, T. Adler, L. Gruber, M. Holzleitner, M. Pavlovi ´c, G. K. Sandveet al., “Hopfield networks is all you need,”arXiv preprint arXiv:2008.02217, 2020. 15

Pith/arXiv arXiv 2008

[35] [35]

How to explain individual classification decisions,

D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. M ¨uller, “How to explain individual classification decisions,” J. Mach. Learn. Res., vol. 11, p. 1803–1831, Aug. 2010

2010

[36] [36]

Visualizing and understanding convolu- tional networks,

M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- tional networks,” inComputer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 818–833

2014

[37] [37]

Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,

A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in2018 IEEE winter conference on applica- tions of computer vision (WACV). IEEE, 2018, pp. 839–847

2018

[38] [38]

A model of saliency-based visual at- tention for rapid scene analysis,

L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual at- tention for rapid scene analysis,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998

1998

[39] [39]

Reliable visualization for deep speaker recognition,

P. Li, L. Li, A. Hamdulla, and D. Wang, “Reliable visualization for deep speaker recognition,”arXiv preprint arXiv:2204.03852, 2022

arXiv 2022

[40] [40]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 618–626

2017

[41] [41]

Markedly enhanced analysis of mass spectrometry images using weakly supervised machine learning,

W. Gardner, D. A. Winkler, S. E. Bamford, B. W. Muir, and P. J. Pigram, “Markedly enhanced analysis of mass spectrometry images using weakly supervised machine learning,”Small Methods, vol. 8, no. 7, p. 2301230, 2024

2024

[42] [42]

Visual explanation and robustness assessment optimization of saliency maps for image classification,

X. Xu and J. Mo, “Visual explanation and robustness assessment optimization of saliency maps for image classification,”The Visual Computer, vol. 39, no. 12, pp. 6097–6113, 2023

2023

[43] [43]

Prototypical networks for few-shot learning,

J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,”Advances in neural information processing systems, vol. 30, 2017

2017