pith. machine review for the scientific record. sign in

arxiv: 2604.01590 · v2 · submitted 2026-04-02 · 📡 eess.AS · cs.SD

Recognition: 1 theorem link

· Lean Theorem

PhiNet: Speaker Verification with Phonetic Interpretability

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:09 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords speaker verificationphonetic interpretabilityforensic speaker comparisonautomatic speaker verificationmodel interpretabilityPhiNetVoxCeleb
0
0 comments X

The pith

PhiNet adds phonetic-level explanations to speaker verification while matching the accuracy of black-box models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhiNet as a speaker verification network that incorporates phonetic evidence to generate interpretable decisions. This design lets users examine speaker-specific phonetic features for manual review and gives developers explicit reasoning for each outcome. The approach draws from human forensic speaker comparison practices to improve transparency in high-stakes applications. Experiments on VoxCeleb, SITW, and LibriSpeech show that the added interpretability does not degrade verification performance relative to standard models.

Core claim

PhiNet enhances local and global interpretability by leveraging phonetic evidence in decision-making, supplying detailed phonetic-level comparisons that support manual inspection of speaker-specific features and explicit reasoning for verification outcomes, while delivering performance comparable to traditional black-box ASV models across VoxCeleb, SITW, and LibriSpeech.

What carries the argument

PhiNet, a speaker verification network that integrates phonetic evidence into its decision process to produce both local phonetic comparisons and global reasoning traces.

If this is right

  • Verification decisions become traceable at the phonetic level, allowing direct manual checks of speaker features.
  • Error analysis and hyperparameter tuning become simpler because the network supplies explicit phonetic reasoning.
  • The system supports forensic applications by aligning automatic outputs with human comparison methods.
  • Performance remains competitive with black-box models on standard benchmarks including VoxCeleb and SITW.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the phonetic features prove stable across languages, the approach could support cross-lingual forensic verification.
  • Integrating PhiNet-style explanations into other audio tasks might improve accountability in voice-based security systems.
  • Controlled user studies comparing PhiNet outputs to traditional forensic reports could quantify gains in decision reliability.

Load-bearing premise

Phonetic evidence extracted by the network genuinely captures speaker-specific traits that human experts can use for forensic inspection without introducing new biases or accuracy loss.

What would settle it

A test set where PhiNet's verification error rate rises above black-box baselines or where human forensic analysts rate the phonetic explanations as uninformative or inconsistent with their own judgments.

Figures

Figures reproduced from arXiv: 2604.01590 by Haizhou Li, Shuai Wang, Tianchi Liu, Yi Ma.

Figure 1
Figure 1. Figure 1: Block diagram of the proposed framework for speaker verification with phonetic interpretability. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Block diagram of the phonetic trait extractor. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the decision making process for a non-target trial (top) and a target trial (bottom). Phonetic boundary are marked by dotted lines [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Phonetic weight distribution across different phonemes. The results shown are obtained using the model trained under System (10) in Table II. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The phonetic weight for models trained with various duration. The configuration of these models are same with System (4), (6), (10), (12) and (14) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Results of leave-ith-phoneme-out experiments on SITW-eval, Vox1-O and LibriSpeech. Leave-ith-phoneme-out experiments conducted in input spectrogram are shown as “spec-sitw-eval”, “spec-vox1-O” and “spec-libriSpeech”. Correspondingly, “trait” means the phonetic trait of each phoneme is left out and ‘baseline’ shows the EER of the network without leaving anything out. D. Interpretability Evaluation 1) Variat… view at source ↗
Figure 8
Figure 8. Figure 8: Similarity heatmaps between the individual phonetic traits and the [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Despite remarkable progress, automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Motivated by how human experts perform forensic speaker comparison (FSC), we propose a speaker verification network with phonetic interpretability, PhiNet, designed to enhance both local and global interpretability by leveraging phonetic evidence in decision-making. For users, PhiNet provides detailed phonetic-level comparisons that enable manual inspection of speaker-specific features and facilitate a more critical evaluation of verification outcomes. For developers, it offers explicit reasoning behind verification decisions, simplifying error tracing and informing hyperparameter selection. In our experiments, we demonstrate PhiNet's interpretability with practical examples, including its application in analyzing the impact of different hyperparameters. We conduct both qualitative and quantitative evaluations of the proposed interpretability methods and assess speaker verification performance across multiple benchmark datasets, including VoxCeleb, SITW, and LibriSpeech. Results show that PhiNet achieves performance comparable to traditional black-box ASV models while offering meaningful, interpretable explanations for its decisions, bridging the gap between ASV and forensic analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents PhiNet, a neural network for automatic speaker verification (ASV) that integrates phonetic interpretability. Motivated by forensic speaker comparison practices, PhiNet leverages phonetic evidence to generate local explanations (detailed phonetic-level comparisons for manual inspection of speaker-specific features) and global explanations (explicit reasoning for verification decisions and hyperparameter impact). Experiments on VoxCeleb, SITW, and LibriSpeech are reported to demonstrate performance comparable to traditional black-box ASV models, supported by qualitative practical examples and both qualitative and quantitative evaluations of the interpretability methods.

Significance. If the reported results hold, this work could meaningfully advance accountable ASV systems by bridging them with forensic analysis through usable phonetic explanations. The dual emphasis on user-facing manual inspection and developer-facing error tracing is a clear strength, and the multi-benchmark evaluation plus hyperparameter analysis examples add practical value. The significance hinges on whether the phonetic components deliver genuine forensic utility without hidden accuracy costs or new biases.

major comments (2)
  1. [§4] §4 (Experiments): The central claim of 'performance comparable to traditional black-box ASV models' requires explicit reporting of metrics such as EER or min t-DCF, baseline comparisons (e.g., x-vector or ECAPA-TDNN), error bars, and ablation results isolating the phonetic module; without these, the empirical parity cannot be verified and is load-bearing for the paper's contribution.
  2. [§5] §5 (Interpretability Evaluation): The assertion that explanations are 'meaningful' and usable for forensic inspection rests on qualitative examples and architectural choice; quantitative support such as fidelity scores, consistency metrics, or a small user study with forensic experts is needed to substantiate actionability and rule out introduced biases.
minor comments (2)
  1. [Abstract] Abstract: Consider adding one sentence specifying how phonetic information is extracted or injected (e.g., phoneme posterior features or auxiliary phonetic loss) to improve immediate clarity.
  2. [Introduction] Notation: Ensure consistent definition of all acronyms (ASV, FSC) on first use and clarify any new symbols introduced for phonetic embeddings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your constructive review and for highlighting areas where the empirical and interpretability claims can be strengthened. We address each major comment below and commit to revisions that will make the supporting evidence explicit and verifiable.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The central claim of 'performance comparable to traditional black-box ASV models' requires explicit reporting of metrics such as EER or min t-DCF, baseline comparisons (e.g., x-vector or ECAPA-TDNN), error bars, and ablation results isolating the phonetic module; without these, the empirical parity cannot be verified and is load-bearing for the paper's contribution.

    Authors: We agree that the current presentation of results is insufficient to substantiate the comparability claim. The revised manuscript will expand §4 with tables reporting EER and min t-DCF on VoxCeleb, SITW, and LibriSpeech, direct comparisons against x-vector and ECAPA-TDNN baselines, standard deviations across runs as error bars, and ablation studies that isolate the phonetic module's contribution. These additions will allow independent verification of performance parity. revision: yes

  2. Referee: [§5] §5 (Interpretability Evaluation): The assertion that explanations are 'meaningful' and usable for forensic inspection rests on qualitative examples and architectural choice; quantitative support such as fidelity scores, consistency metrics, or a small user study with forensic experts is needed to substantiate actionability and rule out introduced biases.

    Authors: We acknowledge the need for stronger quantitative grounding. The revised §5 will add fidelity scores measuring alignment between explanations and model decisions, consistency metrics across similar inputs, and explicit discussion of potential biases. We will also clarify the scope of the existing quantitative evaluations already present in the manuscript and, where feasible, include a small-scale expert review; otherwise we will state the current limitations transparently. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes PhiNet as a network architecture for speaker verification that incorporates phonetic interpretability, motivated by forensic practices. Its central claims rest on empirical evaluations across VoxCeleb, SITW, and LibriSpeech, reporting performance parity with black-box ASV models plus qualitative/quantitative interpretability assessments. No equations, parameter-fitting steps, or self-citation chains are visible that would reduce any prediction or uniqueness claim back to the inputs by construction. The derivation is therefore self-contained as an architectural and experimental contribution rather than a deductive loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. The central claim implicitly assumes that phonetic features can be extracted and aligned in a way that preserves verification accuracy, but no concrete ledger entries can be extracted.

pith-pipeline@v0.9.0 · 5487 in / 1074 out tokens · 27697 ms · 2026-05-13T21:09:00.418346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 2 internal anchors

  1. [1]

    Improving a gmm speaker verification system by phonetic weighting,

    R. Auckenthaler, E. Parris, and M. Carey, “Improving a gmm speaker verification system by phonetic weighting,” in1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1999, pp. 313–316

  2. [2]

    New map estimators for speaker recognition

    P. Kenny, M. Mihoubi, and P. Dumouchel, “New map estimators for speaker recognition.” inInterspeech. Citeseer, 2003, pp. 2961–2964

  3. [3]

    Discriminative phonemes for speaker identification,

    E. S. Parris and M. J. Carey, “Discriminative phonemes for speaker identification,” in3rd International Conference on Spoken Language Processing (ICSLP 1994), 1994, pp. 1843–1846

  4. [4]

    Speaker verification based on broad phonetic categories,

    S. S. Kajarekar and H. Hermansky, “Speaker verification based on broad phonetic categories,” inThe Speaker and Language Recognition Workshop (Odyssey 2001), 2001, pp. 201–206

  5. [5]

    Front- end factor analysis for speaker verification,

    N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- end factor analysis for speaker verification,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011

  6. [6]

    X- vectors: Robust dnn embeddings for speaker recognition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X- vectors: Robust dnn embeddings for speaker recognition,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333

  7. [7]

    But system description to voxceleb speaker recognition challenge 2019,

    H. Zeinali, S. Wang, A. Silnova, P. Mat ˇejka, and O. Plchot, “But system description to voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019

  8. [8]

    ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,

    B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” inProc. Interspeech 2020, 2020, pp. 3830– 3834

  9. [9]

    Forensic speaker comparison: A linguis- tic–acoustic perspective,

    P. Foulkes and P. French, “Forensic speaker comparison: A linguis- tic–acoustic perspective,” inThe Oxford Handbook of Language and Law. Oxford University Press, 03 2012

  10. [10]

    The phonetic bases of speaker recognition,

    F. Nolan, “The phonetic bases of speaker recognition,”Cambridge University Press, Cambridge, 1983

  11. [11]

    Introduction to forensic voice com- parison,

    G. S. Morrison and E. Enzinger, “Introduction to forensic voice com- parison,” inThe Routledge handbook of phonetics. Routledge, 2019, pp. 599–634

  12. [12]

    The role of voice activity detection in forensic speaker verification,

    F. Beritelli and A. Spadaccini, “The role of voice activity detection in forensic speaker verification,” in2011 17th International Conference on Digital Signal Processing (DSP), 2011, pp. 1–6

  13. [13]

    Deep neural networks based speaker modeling at different levels of phonetic granularity,

    Y . Tian, L. He, M. Cai, W.-Q. Zhang, and J. Liu, “Deep neural networks based speaker modeling at different levels of phonetic granularity,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5440–5444

  14. [14]

    V oice comparison approaches for forensic application: A review,

    T. C. Nagavi, P. Maheshaet al., “V oice comparison approaches for forensic application: A review,” in2023 Third International Conference on Secure Cyber Computing and Communication (ICSCCC). IEEE, 2023, pp. 797–802

  15. [15]

    CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking,

    H. Wang, S. Zheng, Y . Chen, L. Cheng, and Q. Chen, “CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking,” inProc. INTERSPEECH 2023, 2023, pp. 5301–5305

  16. [16]

    Explainable artificial intelligence (xai): Concepts, taxonomies, opportu- nities and challenges toward responsible ai,

    A. B. Arrieta, N. D ´ıaz-Rodr´ıguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garc ´ıa, S. Gil-L ´opez, D. Molina, R. Benjaminset al., “Explainable artificial intelligence (xai): Concepts, taxonomies, opportu- nities and challenges toward responsible ai,”Information fusion, vol. 58, pp. 82–115, 2020

  17. [17]

    V oiceprint identification,

    L. G. KERSTA, “V oiceprint identification,”Nature, vol. 196, no. 4861, p. 1253, 1962. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

  18. [18]

    Speaker identification by speech spectrograms: some further observations,

    R. H. Bolt, F. S. Cooper, E. E. David, P. B. Denes, J. M. Pickett, and K. N. Stevens, “Speaker identification by speech spectrograms: some further observations,”The Journal of the Acoustical Society of America, vol. 54, no. 2, pp. 531–534, 1973

  19. [19]

    Hollien,The acoustics of crime: The new science of forensic phonetics

    H. Hollien,The acoustics of crime: The new science of forensic phonetics. Springer Science & Business Media, 2013

  20. [20]

    Rose,Forensic speaker identification

    P. Rose,Forensic speaker identification. cRc Press, 2002

  21. [21]

    Speech fundamental frequency over the telephone and face-to-face: Some implications for forensic phonetics 1,

    A. Hirson, P. French, and D. Howard, “Speech fundamental frequency over the telephone and face-to-face: Some implications for forensic phonetics 1,” inStudies in General and English Phonetics. Routledge, 2012, pp. 230–240

  22. [22]

    F0 statistics for 100 young male speakers of standard southern british english,

    T. Hudson, G. De Jong, K. McDougall, P. Harrison, and F. Nolan, “F0 statistics for 100 young male speakers of standard southern british english,” inProceedings of the 16th international congress of phonetic sciences, vol. 6, no. 10. Citeseer, 2007

  23. [23]

    Fundamental frequency: how speaker-specific is it?

    A. Braun, “Fundamental frequency: how speaker-specific is it?”Beitr ¨age zur Phonetik und Linguistik, vol. 64, pp. 9–23, 1995

  24. [24]

    A phone-level speaker embedding extraction framework with multi-gate mixture-of- experts based multi-task learning,

    Z. Yang, M. Du, R. Su, X. Liu, N. Yan, and L. Wang, “A phone-level speaker embedding extraction framework with multi-gate mixture-of- experts based multi-task learning,” in2022 13th International Sympo- sium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2022, pp. 240–244

  25. [25]

    A survey on neural network interpretability,

    Y . Zhang, P. Ti ˇno, A. Leonardis, and K. Tang, “A survey on neural network interpretability,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 5, no. 5, pp. 726–742, 2021

  26. [26]

    Explainable ai (xai): A systematic meta- survey of current challenges and future opportunities,

    W. Saeed and C. Omlin, “Explainable ai (xai): A systematic meta- survey of current challenges and future opportunities,”Knowledge-Based Systems, vol. 263, p. 110273, 2023

  27. [27]

    Self-interpretable model with transformation equivariant interpretation,

    Y . Wang and X. Wang, “Self-interpretable model with transformation equivariant interpretation,”Advances in Neural Information Processing Systems, vol. 34, pp. 2359–2372, 2021

  28. [28]

    Towards robust interpretability with self-explaining neural networks,

    D. Alvarez Melis and T. Jaakkola, “Towards robust interpretability with self-explaining neural networks,”Advances in neural information processing systems, vol. 31, 2018

  29. [29]

    Extracting tree-structured representations of trained networks,

    M. Craven and J. Shavlik, “Extracting tree-structured representations of trained networks,”Advances in neural information processing systems, vol. 8, 1995

  30. [30]

    Survey and critique of techniques for extracting rules from trained artificial neural networks,

    R. Andrews, J. Diederich, and A. B. Tickle, “Survey and critique of techniques for extracting rules from trained artificial neural networks,” Knowledge-based systems, vol. 8, no. 6, pp. 373–389, 1995

  31. [31]

    Deep inside convolutional networks: visualising image classification models and saliency maps,

    K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: visualising image classification models and saliency maps,” inProceedings of the International Conference on Learning Represen- tations (ICLR). ICLR, 2014

  32. [32]

    Grad-cam: Visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 618–626

  33. [33]

    Axiomatic attribution for deep networks,

    M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” inInternational conference on machine learning. PMLR, 2017, pp. 3319–3328

  34. [34]

    Score-cam: Score-weighted visual explanations for convo- lutional neural networks,

    H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convo- lutional neural networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 24–25

  35. [35]

    Layercam: Exploring hierarchical class activation maps for localization,

    P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layercam: Exploring hierarchical class activation maps for localization,”IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021

  36. [36]

    “why should i trust you?

    M. T. Ribeiro, S. Singh, and C. Guestrin, ““why should i trust you?” explaining the predictions of any classifier,” inProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144

  37. [37]

    A value for n-person games,

    L. S. Shapley, “A value for n-person games,”Contribution to the Theory of Games, vol. 2, 1953

  38. [38]

    A unified approach to interpreting model predictions,

    S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,”Advances in neural information processing systems, vol. 30, 2017

  39. [39]

    RISE: Randomized Input Sampling for Explanation of Black-box Models

    V . Petsiuk, “Rise: Randomized input sampling for explanation of black- box models,”arXiv preprint arXiv:1806.07421, 2018

  40. [40]

    Understanding deep networks via extremal perturbations and smooth masks,

    R. Fong, M. Patrick, and A. Vedaldi, “Understanding deep networks via extremal perturbations and smooth masks,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2950– 2958

  41. [41]

    Visualizing deep neural network by alternately image blurring and deblurring,

    F. Wang, H. Liu, and J. Cheng, “Visualizing deep neural network by alternately image blurring and deblurring,”Neural Networks, vol. 97, pp. 162–172, 2018

  42. [42]

    Interpretable convolutional neural networks,

    Q. Zhang, Y . N. Wu, and S.-C. Zhu, “Interpretable convolutional neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8827–8836

  43. [43]

    Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks,

    R. Fong and A. Vedaldi, “Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8730–8738

  44. [44]

    What is one grain of sand in the desert? analyzing individual neurons in deep nlp models,

    F. Dalvi, N. Durrani, H. Sajjad, Y . Belinkov, A. Bau, and J. Glass, “What is one grain of sand in the desert? analyzing individual neurons in deep nlp models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 6309–6317

  45. [45]

    Representer point selection for explaining deep neural networks,

    C.-K. Yeh, J. Kim, I. E.-H. Yen, and P. K. Ravikumar, “Representer point selection for explaining deep neural networks,”Advances in neural information processing systems, vol. 31, 2018

  46. [46]

    Understanding black-box predictions via influence functions,

    P. W. Koh and P. Liang, “Understanding black-box predictions via influence functions,” inInternational conference on machine learning. PMLR, 2017, pp. 1885–1894

  47. [47]

    Deep learning for case- based reasoning through prototypes: A neural network that explains its predictions,

    O. Li, H. Liu, C. Chen, and C. Rudin, “Deep learning for case- based reasoning through prototypes: A neural network that explains its predictions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018

  48. [48]

    This looks like that: deep learning for interpretable image recognition,

    C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This looks like that: deep learning for interpretable image recognition,”Advances in neural information processing systems, vol. 32, 2019

  49. [49]

    Interpretable image classification with differentiable prototypes assignment,

    D. Rymarczyk, Ł. Struski, M. G ´orszczak, K. Lewandowska, J. Tabor, and B. Zieli ´nski, “Interpretable image classification with differentiable prototypes assignment,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 351–368

  50. [50]

    Deformable protopnet: An in- terpretable image classifier using deformable prototypes,

    J. Donnelly, A. J. Barnett, and C. Chen, “Deformable protopnet: An in- terpretable image classifier using deformable prototypes,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 265–10 275

  51. [51]

    Pip-net: Patch-based intuitive prototypes for interpretable image classification,

    M. Nauta, J. Schl ¨otterer, M. Van Keulen, and C. Seifert, “Pip-net: Patch-based intuitive prototypes for interpretable image classification,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2744–2753

  52. [52]

    Xprotonet: diagnosis in chest radiography with global and local explanations,

    E. Kim, S. Kim, M. Seo, and S. Yoon, “Xprotonet: diagnosis in chest radiography with global and local explanations,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 719–15 728

  53. [53]

    Concept bottleneck models,

    P. W. Koh, T. Nguyen, Y . S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang, “Concept bottleneck models,” inInternational conference on machine learning. PMLR, 2020, pp. 5338–5348

  54. [54]

    Language in a bottle: Language model guided concept bottlenecks for interpretable image classification,

    Y . Yang, A. Panagopoulou, S. Zhou, D. Jin, C. Callison-Burch, and M. Yatskar, “Language in a bottle: Language model guided concept bottlenecks for interpretable image classification,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 187–19 197

  55. [55]

    Feature importance ranking for deep learning,

    M. Wojtas and K. Chen, “Feature importance ranking for deep learning,” Advances in neural information processing systems, vol. 33, pp. 5105– 5114, 2020

  56. [56]

    Learning deep attribution priors based on prior knowledge,

    E. Weinberger, J. Janizek, and S.-I. Lee, “Learning deep attribution priors based on prior knowledge,”Advances in Neural Information Processing Systems, vol. 33, pp. 14 034–14 045, 2020

  57. [57]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt, “Interpretability in the wild: a circuit for indirect object identification in gpt-2 small,”arXiv preprint arXiv:2211.00593, 2022

  58. [58]

    Progress measures for grokking via mechanistic interpretability

    N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt, “Progress measures for grokking via mechanistic interpretability,”arXiv preprint arXiv:2301.05217, 2023

  59. [59]

    Decomposing and editing predictions by modeling model computation,

    H. Shah, A. Ilyas, and A. Madry, “Decomposing and editing predictions by modeling model computation,”arXiv preprint arXiv:2404.11534, 2024

  60. [60]

    Anchors: High-precision model-agnostic explanations,

    M. T. Ribeiro, S. Singh, and C. Guestrin, “Anchors: High-precision model-agnostic explanations,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  61. [61]

    Explanations based on the missing: Towards contrastive explanations with pertinent negatives,

    A. Dhurandhar, P.-Y . Chen, R. Luss, C.-C. Tu, P. Ting, K. Shanmugam, and P. Das, “Explanations based on the missing: Towards contrastive explanations with pertinent negatives,”Advances in neural information processing systems, vol. 31, 2018

  62. [62]

    Learning global transparent models consistent with local contrastive ex- planations,

    T. Pedapati, A. Balakrishnan, K. Shanmugam, and A. Dhurandhar, “Learning global transparent models consistent with local contrastive ex- planations,”Advances in neural information processing systems, vol. 33, pp. 3592–3602, 2020

  63. [63]

    Reliable visualization for deep speaker recognition,

    P. Li, L. Li, A. Hamdulla, and D. Wang, “Reliable visualization for deep speaker recognition,”arXiv preprint arXiv:2204.03852, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

  64. [64]

    Visualizing Data Augmentation in Deep Speaker Recognition,

    ——, “Visualizing Data Augmentation in Deep Speaker Recognition,” inProc. INTERSPEECH 2023, 2023, pp. 2243–2247

  65. [65]

    How phonemes con- tribute to deep speaker models?

    P. Li, T. Wang, L. Li, A. Hamdulla, and D. Wang, “How phonemes con- tribute to deep speaker models?” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2024, pp. 838–842

  66. [66]

    A study on visualization of voiceprint feature,

    J. Zhang, L. He, X. Guo, and J. Ma, “A study on visualization of voiceprint feature,” inProc. INTERSPEECH 2023, 2023, pp. 2233–2237

  67. [67]

    A phonetic analysis of speaker verification systems through phoneme selection and integrated gradients,

    T. Thebaud, G. Sierra, and M. Juan, Sarahand Tahon, “A phonetic analysis of speaker verification systems through phoneme selection and integrated gradients,” inSpeaker and Language Recognition Workshop- Odyssey, 2024

  68. [68]

    De- scribing the phonetics in the underlying speech attributes for deep and interpretable speaker recognition,

    I. Ben-Amor, J.-F. Bonastre, B. O’Brien, and P.-M. Bousquet, “De- scribing the phonetics in the underlying speech attributes for deep and interpretable speaker recognition,” inInterspeech 2023, 2023

  69. [69]

    Expo: Explainable phonetic trait-oriented network for speaker verification,

    Y . Ma, S. Wang, T. Liu, and H. Li, “Expo: Explainable phonetic trait-oriented network for speaker verification,”IEEE Signal Processing Letters, 2025

  70. [70]

    Explainable attribute-based speaker verification.arXiv preprint arXiv:2405.19796, 2024

    X. Wu, C. Luu, P. Bell, and A. Rajan, “Explainable attribute-based speaker verification,”arXiv preprint arXiv:2405.19796, 2024

  71. [71]

    Phone-to-audio alignment without text: A semi-supervised approach,

    J. Zhu, C. Zhang, and D. Jurgens, “Phone-to-audio alignment without text: A semi-supervised approach,”IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

  72. [72]

    V oxCeleb: A Large-Scale Speaker Identification Dataset,

    A. Nagrani, J. S. Chung, and A. Zisserman, “V oxCeleb: A Large-Scale Speaker Identification Dataset,” inProc. Interspeech 2017, 2017, pp. 2616–2620

  73. [73]

    V oxCeleb2: Deep Speaker Recognition,

    J. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep Speaker Recognition,” inProc. Interspeech 2018, 2018, pp. 1086–1090

  74. [74]

    The speakers in the wild (sitw) speaker recognition database,

    M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (sitw) speaker recognition database,” inInterspeech 2016, 2016, pp. 818–822

  75. [75]

    Librispeech: An asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

  76. [76]

    Accessed: March 5, 2024

    The cmu pronouncing dictionary. Accessed: March 5, 2024. [Online]. Available: http://www.speech.cs.cmu.edu/cgi-bin/cmudict

  77. [77]

    Wespeaker: A research and production oriented speaker embedding learning toolkit,

    H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

  78. [78]

    MUSAN: A Music, Speech, and Noise Corpus

    D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

  79. [79]

    A study on data augmentation of reverberant speech for robust speech recognition,

    T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5220–5224

  80. [80]

    Deep metric learning with angular loss,

    J. Wang, F. Zhou, S. Wen, X. Liu, and Y . Lin, “Deep metric learning with angular loss,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2593–2601

Showing first 80 references.