arxiv: 2604.01590 · v2 · submitted 2026-04-02 · 📡 eess.AS · cs.SD

Recognition: 1 theorem link

· Lean Theorem

PhiNet: Speaker Verification with Phonetic Interpretability

Yi Ma , Shuai Wang , Tianchi Liu , Haizhou Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:09 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords speaker verificationphonetic interpretabilityforensic speaker comparisonautomatic speaker verificationmodel interpretabilityPhiNetVoxCeleb

0 comments

The pith

PhiNet adds phonetic-level explanations to speaker verification while matching the accuracy of black-box models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhiNet as a speaker verification network that incorporates phonetic evidence to generate interpretable decisions. This design lets users examine speaker-specific phonetic features for manual review and gives developers explicit reasoning for each outcome. The approach draws from human forensic speaker comparison practices to improve transparency in high-stakes applications. Experiments on VoxCeleb, SITW, and LibriSpeech show that the added interpretability does not degrade verification performance relative to standard models.

Core claim

PhiNet enhances local and global interpretability by leveraging phonetic evidence in decision-making, supplying detailed phonetic-level comparisons that support manual inspection of speaker-specific features and explicit reasoning for verification outcomes, while delivering performance comparable to traditional black-box ASV models across VoxCeleb, SITW, and LibriSpeech.

What carries the argument

PhiNet, a speaker verification network that integrates phonetic evidence into its decision process to produce both local phonetic comparisons and global reasoning traces.

If this is right

Verification decisions become traceable at the phonetic level, allowing direct manual checks of speaker features.
Error analysis and hyperparameter tuning become simpler because the network supplies explicit phonetic reasoning.
The system supports forensic applications by aligning automatic outputs with human comparison methods.
Performance remains competitive with black-box models on standard benchmarks including VoxCeleb and SITW.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the phonetic features prove stable across languages, the approach could support cross-lingual forensic verification.
Integrating PhiNet-style explanations into other audio tasks might improve accountability in voice-based security systems.
Controlled user studies comparing PhiNet outputs to traditional forensic reports could quantify gains in decision reliability.

Load-bearing premise

Phonetic evidence extracted by the network genuinely captures speaker-specific traits that human experts can use for forensic inspection without introducing new biases or accuracy loss.

What would settle it

A test set where PhiNet's verification error rate rises above black-box baselines or where human forensic analysts rate the phonetic explanations as uninformative or inconsistent with their own judgments.

Figures

Figures reproduced from arXiv: 2604.01590 by Haizhou Li, Shuai Wang, Tianchi Liu, Yi Ma.

**Figure 2.** Figure 2: Block diagram of the phonetic trait extractor. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the decision making process for a non-target trial (top) and a target trial (bottom). Phonetic boundary are marked by dotted lines [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Phonetic weight distribution across different phonemes. The results shown are obtained using the model trained under System (10) in Table II. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: The phonetic weight for models trained with various duration. The configuration of these models are same with System (4), (6), (10), (12) and (14) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Results of leave-ith-phoneme-out experiments on SITW-eval, Vox1-O and LibriSpeech. Leave-ith-phoneme-out experiments conducted in input spectrogram are shown as “spec-sitw-eval”, “spec-vox1-O” and “spec-libriSpeech”. Correspondingly, “trait” means the phonetic trait of each phoneme is left out and ‘baseline’ shows the EER of the network without leaving anything out. D. Interpretability Evaluation 1) Variat… view at source ↗

**Figure 8.** Figure 8: Similarity heatmaps between the individual phonetic traits and the [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Despite remarkable progress, automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Motivated by how human experts perform forensic speaker comparison (FSC), we propose a speaker verification network with phonetic interpretability, PhiNet, designed to enhance both local and global interpretability by leveraging phonetic evidence in decision-making. For users, PhiNet provides detailed phonetic-level comparisons that enable manual inspection of speaker-specific features and facilitate a more critical evaluation of verification outcomes. For developers, it offers explicit reasoning behind verification decisions, simplifying error tracing and informing hyperparameter selection. In our experiments, we demonstrate PhiNet's interpretability with practical examples, including its application in analyzing the impact of different hyperparameters. We conduct both qualitative and quantitative evaluations of the proposed interpretability methods and assess speaker verification performance across multiple benchmark datasets, including VoxCeleb, SITW, and LibriSpeech. Results show that PhiNet achieves performance comparable to traditional black-box ASV models while offering meaningful, interpretable explanations for its decisions, bridging the gap between ASV and forensic analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhiNet adds phonetic comparisons to speaker verification for interpretability, but the evidence for matching black-box performance and real forensic value stays thin.

read the letter

PhiNet is a new network that builds phonetic evidence directly into the verification process so decisions come with local comparisons and global reasoning. The authors frame this around how forensic experts actually work, which gives the paper a clear motivation that most ASV papers skip. They test on VoxCeleb, SITW, and LibriSpeech and show examples of how the phonetic outputs can highlight speaker-specific features or trace hyperparameter effects. That combination of architecture and use-case examples is the main novelty here. The work does a reasonable job showing why black-box models fall short for high-stakes audio and offering a concrete alternative that developers could inspect more easily than post-hoc methods. The qualitative demonstrations look practical on first read. The soft spots sit in the results. The abstract and available sections claim comparable performance without supplying error rates, ablations, or statistical checks, so it is impossible to judge whether the phonetic additions preserve accuracy or introduce new failure modes. The interpretability claims rest on the design plus selected examples rather than controlled tests against forensic experts or measures of explanation fidelity. Without those, the bridge to manual forensic inspection remains more asserted than shown. This paper is aimed at researchers who want interpretable ASV for biometrics or forensics-adjacent work. A reader already thinking about explainability in audio would pick up usable design ideas even if the experiments need tightening. It deserves peer review because the core framing is distinct and the motivation holds up, though any referee will need to press for quantitative metrics and validation of the explanations before the claims can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The manuscript presents PhiNet, a neural network for automatic speaker verification (ASV) that integrates phonetic interpretability. Motivated by forensic speaker comparison practices, PhiNet leverages phonetic evidence to generate local explanations (detailed phonetic-level comparisons for manual inspection of speaker-specific features) and global explanations (explicit reasoning for verification decisions and hyperparameter impact). Experiments on VoxCeleb, SITW, and LibriSpeech are reported to demonstrate performance comparable to traditional black-box ASV models, supported by qualitative practical examples and both qualitative and quantitative evaluations of the interpretability methods.

Significance. If the reported results hold, this work could meaningfully advance accountable ASV systems by bridging them with forensic analysis through usable phonetic explanations. The dual emphasis on user-facing manual inspection and developer-facing error tracing is a clear strength, and the multi-benchmark evaluation plus hyperparameter analysis examples add practical value. The significance hinges on whether the phonetic components deliver genuine forensic utility without hidden accuracy costs or new biases.

major comments (2)

[§4] §4 (Experiments): The central claim of 'performance comparable to traditional black-box ASV models' requires explicit reporting of metrics such as EER or min t-DCF, baseline comparisons (e.g., x-vector or ECAPA-TDNN), error bars, and ablation results isolating the phonetic module; without these, the empirical parity cannot be verified and is load-bearing for the paper's contribution.
[§5] §5 (Interpretability Evaluation): The assertion that explanations are 'meaningful' and usable for forensic inspection rests on qualitative examples and architectural choice; quantitative support such as fidelity scores, consistency metrics, or a small user study with forensic experts is needed to substantiate actionability and rule out introduced biases.

minor comments (2)

[Abstract] Abstract: Consider adding one sentence specifying how phonetic information is extracted or injected (e.g., phoneme posterior features or auxiliary phonetic loss) to improve immediate clarity.
[Introduction] Notation: Ensure consistent definition of all acronyms (ASV, FSC) on first use and clarify any new symbols introduced for phonetic embeddings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your constructive review and for highlighting areas where the empirical and interpretability claims can be strengthened. We address each major comment below and commit to revisions that will make the supporting evidence explicit and verifiable.

read point-by-point responses

Referee: [§4] §4 (Experiments): The central claim of 'performance comparable to traditional black-box ASV models' requires explicit reporting of metrics such as EER or min t-DCF, baseline comparisons (e.g., x-vector or ECAPA-TDNN), error bars, and ablation results isolating the phonetic module; without these, the empirical parity cannot be verified and is load-bearing for the paper's contribution.

Authors: We agree that the current presentation of results is insufficient to substantiate the comparability claim. The revised manuscript will expand §4 with tables reporting EER and min t-DCF on VoxCeleb, SITW, and LibriSpeech, direct comparisons against x-vector and ECAPA-TDNN baselines, standard deviations across runs as error bars, and ablation studies that isolate the phonetic module's contribution. These additions will allow independent verification of performance parity. revision: yes
Referee: [§5] §5 (Interpretability Evaluation): The assertion that explanations are 'meaningful' and usable for forensic inspection rests on qualitative examples and architectural choice; quantitative support such as fidelity scores, consistency metrics, or a small user study with forensic experts is needed to substantiate actionability and rule out introduced biases.

Authors: We acknowledge the need for stronger quantitative grounding. The revised §5 will add fidelity scores measuring alignment between explanations and model decisions, consistency metrics across similar inputs, and explicit discussion of potential biases. We will also clarify the scope of the existing quantitative evaluations already present in the manuscript and, where feasible, include a small-scale expert review; otherwise we will state the current limitations transparently. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes PhiNet as a network architecture for speaker verification that incorporates phonetic interpretability, motivated by forensic practices. Its central claims rest on empirical evaluations across VoxCeleb, SITW, and LibriSpeech, reporting performance parity with black-box ASV models plus qualitative/quantitative interpretability assessments. No equations, parameter-fitting steps, or self-citation chains are visible that would reduce any prediction or uniqueness claim back to the inputs by construction. The derivation is therefore self-contained as an architectural and experimental contribution rather than a deductive loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. The central claim implicitly assumes that phonetic features can be extracted and aligned in a way that preserves verification accuracy, but no concrete ledger entries can be extracted.

pith-pipeline@v0.9.0 · 5487 in / 1074 out tokens · 27697 ms · 2026-05-13T21:09:00.418346+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PhiNet performs speaker verification by comparing enrollment and test utterances via phonetic traits e_i, t_i, cosine similarities s_i, and learned weights w_i to produce final score y

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 2 internal anchors

[1]

Improving a gmm speaker verification system by phonetic weighting,

R. Auckenthaler, E. Parris, and M. Carey, “Improving a gmm speaker verification system by phonetic weighting,” in1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1999, pp. 313–316

work page 1999
[2]

New map estimators for speaker recognition

P. Kenny, M. Mihoubi, and P. Dumouchel, “New map estimators for speaker recognition.” inInterspeech. Citeseer, 2003, pp. 2961–2964

work page 2003
[3]

Discriminative phonemes for speaker identification,

E. S. Parris and M. J. Carey, “Discriminative phonemes for speaker identification,” in3rd International Conference on Spoken Language Processing (ICSLP 1994), 1994, pp. 1843–1846

work page 1994
[4]

Speaker verification based on broad phonetic categories,

S. S. Kajarekar and H. Hermansky, “Speaker verification based on broad phonetic categories,” inThe Speaker and Language Recognition Workshop (Odyssey 2001), 2001, pp. 201–206

work page 2001
[5]

Front- end factor analysis for speaker verification,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- end factor analysis for speaker verification,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011

work page 2011
[6]

X- vectors: Robust dnn embeddings for speaker recognition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X- vectors: Robust dnn embeddings for speaker recognition,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333

work page 2018
[7]

But system description to voxceleb speaker recognition challenge 2019,

H. Zeinali, S. Wang, A. Silnova, P. Mat ˇejka, and O. Plchot, “But system description to voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019

work page arXiv 2019
[8]

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,

B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” inProc. Interspeech 2020, 2020, pp. 3830– 3834

work page 2020
[9]

Forensic speaker comparison: A linguis- tic–acoustic perspective,

P. Foulkes and P. French, “Forensic speaker comparison: A linguis- tic–acoustic perspective,” inThe Oxford Handbook of Language and Law. Oxford University Press, 03 2012

work page 2012
[10]

The phonetic bases of speaker recognition,

F. Nolan, “The phonetic bases of speaker recognition,”Cambridge University Press, Cambridge, 1983

work page 1983
[11]

Introduction to forensic voice com- parison,

G. S. Morrison and E. Enzinger, “Introduction to forensic voice com- parison,” inThe Routledge handbook of phonetics. Routledge, 2019, pp. 599–634

work page 2019
[12]

The role of voice activity detection in forensic speaker verification,

F. Beritelli and A. Spadaccini, “The role of voice activity detection in forensic speaker verification,” in2011 17th International Conference on Digital Signal Processing (DSP), 2011, pp. 1–6

work page 2011
[13]

Deep neural networks based speaker modeling at different levels of phonetic granularity,

Y . Tian, L. He, M. Cai, W.-Q. Zhang, and J. Liu, “Deep neural networks based speaker modeling at different levels of phonetic granularity,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5440–5444

work page 2017
[14]

V oice comparison approaches for forensic application: A review,

T. C. Nagavi, P. Maheshaet al., “V oice comparison approaches for forensic application: A review,” in2023 Third International Conference on Secure Cyber Computing and Communication (ICSCCC). IEEE, 2023, pp. 797–802

work page 2023
[15]

CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking,

H. Wang, S. Zheng, Y . Chen, L. Cheng, and Q. Chen, “CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking,” inProc. INTERSPEECH 2023, 2023, pp. 5301–5305

work page 2023
[16]

Explainable artificial intelligence (xai): Concepts, taxonomies, opportu- nities and challenges toward responsible ai,

A. B. Arrieta, N. D ´ıaz-Rodr´ıguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garc ´ıa, S. Gil-L ´opez, D. Molina, R. Benjaminset al., “Explainable artificial intelligence (xai): Concepts, taxonomies, opportu- nities and challenges toward responsible ai,”Information fusion, vol. 58, pp. 82–115, 2020

work page 2020
[17]

V oiceprint identification,

L. G. KERSTA, “V oiceprint identification,”Nature, vol. 196, no. 4861, p. 1253, 1962. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

work page 1962
[18]

Speaker identification by speech spectrograms: some further observations,

R. H. Bolt, F. S. Cooper, E. E. David, P. B. Denes, J. M. Pickett, and K. N. Stevens, “Speaker identification by speech spectrograms: some further observations,”The Journal of the Acoustical Society of America, vol. 54, no. 2, pp. 531–534, 1973

work page 1973
[19]

Hollien,The acoustics of crime: The new science of forensic phonetics

H. Hollien,The acoustics of crime: The new science of forensic phonetics. Springer Science & Business Media, 2013

work page 2013
[20]

Rose,Forensic speaker identification

P. Rose,Forensic speaker identification. cRc Press, 2002

work page 2002
[21]

Speech fundamental frequency over the telephone and face-to-face: Some implications for forensic phonetics 1,

A. Hirson, P. French, and D. Howard, “Speech fundamental frequency over the telephone and face-to-face: Some implications for forensic phonetics 1,” inStudies in General and English Phonetics. Routledge, 2012, pp. 230–240

work page 2012
[22]

F0 statistics for 100 young male speakers of standard southern british english,

T. Hudson, G. De Jong, K. McDougall, P. Harrison, and F. Nolan, “F0 statistics for 100 young male speakers of standard southern british english,” inProceedings of the 16th international congress of phonetic sciences, vol. 6, no. 10. Citeseer, 2007

work page 2007
[23]

Fundamental frequency: how speaker-specific is it?

A. Braun, “Fundamental frequency: how speaker-specific is it?”Beitr ¨age zur Phonetik und Linguistik, vol. 64, pp. 9–23, 1995

work page 1995
[24]

A phone-level speaker embedding extraction framework with multi-gate mixture-of- experts based multi-task learning,

Z. Yang, M. Du, R. Su, X. Liu, N. Yan, and L. Wang, “A phone-level speaker embedding extraction framework with multi-gate mixture-of- experts based multi-task learning,” in2022 13th International Sympo- sium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2022, pp. 240–244

work page 2022
[25]

A survey on neural network interpretability,

Y . Zhang, P. Ti ˇno, A. Leonardis, and K. Tang, “A survey on neural network interpretability,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 5, no. 5, pp. 726–742, 2021

work page 2021
[26]

Explainable ai (xai): A systematic meta- survey of current challenges and future opportunities,

W. Saeed and C. Omlin, “Explainable ai (xai): A systematic meta- survey of current challenges and future opportunities,”Knowledge-Based Systems, vol. 263, p. 110273, 2023

work page 2023
[27]

Self-interpretable model with transformation equivariant interpretation,

Y . Wang and X. Wang, “Self-interpretable model with transformation equivariant interpretation,”Advances in Neural Information Processing Systems, vol. 34, pp. 2359–2372, 2021

work page 2021
[28]

Towards robust interpretability with self-explaining neural networks,

D. Alvarez Melis and T. Jaakkola, “Towards robust interpretability with self-explaining neural networks,”Advances in neural information processing systems, vol. 31, 2018

work page 2018
[29]

Extracting tree-structured representations of trained networks,

M. Craven and J. Shavlik, “Extracting tree-structured representations of trained networks,”Advances in neural information processing systems, vol. 8, 1995

work page 1995
[30]

Survey and critique of techniques for extracting rules from trained artificial neural networks,

R. Andrews, J. Diederich, and A. B. Tickle, “Survey and critique of techniques for extracting rules from trained artificial neural networks,” Knowledge-based systems, vol. 8, no. 6, pp. 373–389, 1995

work page 1995
[31]

Deep inside convolutional networks: visualising image classification models and saliency maps,

K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: visualising image classification models and saliency maps,” inProceedings of the International Conference on Learning Represen- tations (ICLR). ICLR, 2014

work page 2014
[32]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 618–626

work page 2017
[33]

Axiomatic attribution for deep networks,

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” inInternational conference on machine learning. PMLR, 2017, pp. 3319–3328

work page 2017
[34]

Score-cam: Score-weighted visual explanations for convo- lutional neural networks,

H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convo- lutional neural networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 24–25

work page 2020
[35]

Layercam: Exploring hierarchical class activation maps for localization,

P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layercam: Exploring hierarchical class activation maps for localization,”IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021

work page 2021
[36]

“why should i trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin, ““why should i trust you?” explaining the predictions of any classifier,” inProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144

work page 2016
[37]

A value for n-person games,

L. S. Shapley, “A value for n-person games,”Contribution to the Theory of Games, vol. 2, 1953

work page 1953
[38]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[39]

RISE: Randomized Input Sampling for Explanation of Black-box Models

V . Petsiuk, “Rise: Randomized input sampling for explanation of black- box models,”arXiv preprint arXiv:1806.07421, 2018

work page Pith review arXiv 2018
[40]

Understanding deep networks via extremal perturbations and smooth masks,

R. Fong, M. Patrick, and A. Vedaldi, “Understanding deep networks via extremal perturbations and smooth masks,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2950– 2958

work page 2019
[41]

Visualizing deep neural network by alternately image blurring and deblurring,

F. Wang, H. Liu, and J. Cheng, “Visualizing deep neural network by alternately image blurring and deblurring,”Neural Networks, vol. 97, pp. 162–172, 2018

work page 2018
[42]

Interpretable convolutional neural networks,

Q. Zhang, Y . N. Wu, and S.-C. Zhu, “Interpretable convolutional neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8827–8836

work page 2018
[43]

Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks,

R. Fong and A. Vedaldi, “Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8730–8738

work page 2018
[44]

What is one grain of sand in the desert? analyzing individual neurons in deep nlp models,

F. Dalvi, N. Durrani, H. Sajjad, Y . Belinkov, A. Bau, and J. Glass, “What is one grain of sand in the desert? analyzing individual neurons in deep nlp models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 6309–6317

work page 2019
[45]

Representer point selection for explaining deep neural networks,

C.-K. Yeh, J. Kim, I. E.-H. Yen, and P. K. Ravikumar, “Representer point selection for explaining deep neural networks,”Advances in neural information processing systems, vol. 31, 2018

work page 2018
[46]

Understanding black-box predictions via influence functions,

P. W. Koh and P. Liang, “Understanding black-box predictions via influence functions,” inInternational conference on machine learning. PMLR, 2017, pp. 1885–1894

work page 2017
[47]

Deep learning for case- based reasoning through prototypes: A neural network that explains its predictions,

O. Li, H. Liu, C. Chen, and C. Rudin, “Deep learning for case- based reasoning through prototypes: A neural network that explains its predictions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018

work page 2018
[48]

This looks like that: deep learning for interpretable image recognition,

C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This looks like that: deep learning for interpretable image recognition,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[49]

Interpretable image classification with differentiable prototypes assignment,

D. Rymarczyk, Ł. Struski, M. G ´orszczak, K. Lewandowska, J. Tabor, and B. Zieli ´nski, “Interpretable image classification with differentiable prototypes assignment,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 351–368

work page 2022
[50]

Deformable protopnet: An in- terpretable image classifier using deformable prototypes,

J. Donnelly, A. J. Barnett, and C. Chen, “Deformable protopnet: An in- terpretable image classifier using deformable prototypes,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 265–10 275

work page 2022
[51]

Pip-net: Patch-based intuitive prototypes for interpretable image classification,

M. Nauta, J. Schl ¨otterer, M. Van Keulen, and C. Seifert, “Pip-net: Patch-based intuitive prototypes for interpretable image classification,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2744–2753

work page 2023
[52]

Xprotonet: diagnosis in chest radiography with global and local explanations,

E. Kim, S. Kim, M. Seo, and S. Yoon, “Xprotonet: diagnosis in chest radiography with global and local explanations,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 719–15 728

work page 2021
[53]

Concept bottleneck models,

P. W. Koh, T. Nguyen, Y . S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang, “Concept bottleneck models,” inInternational conference on machine learning. PMLR, 2020, pp. 5338–5348

work page 2020
[54]

Language in a bottle: Language model guided concept bottlenecks for interpretable image classification,

Y . Yang, A. Panagopoulou, S. Zhou, D. Jin, C. Callison-Burch, and M. Yatskar, “Language in a bottle: Language model guided concept bottlenecks for interpretable image classification,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 187–19 197

work page 2023
[55]

Feature importance ranking for deep learning,

M. Wojtas and K. Chen, “Feature importance ranking for deep learning,” Advances in neural information processing systems, vol. 33, pp. 5105– 5114, 2020

work page 2020
[56]

Learning deep attribution priors based on prior knowledge,

E. Weinberger, J. Janizek, and S.-I. Lee, “Learning deep attribution priors based on prior knowledge,”Advances in Neural Information Processing Systems, vol. 33, pp. 14 034–14 045, 2020

work page 2020
[57]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt, “Interpretability in the wild: a circuit for indirect object identification in gpt-2 small,”arXiv preprint arXiv:2211.00593, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Progress measures for grokking via mechanistic interpretability

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt, “Progress measures for grokking via mechanistic interpretability,”arXiv preprint arXiv:2301.05217, 2023

work page internal anchor Pith review arXiv 2023
[59]

Decomposing and editing predictions by modeling model computation,

H. Shah, A. Ilyas, and A. Madry, “Decomposing and editing predictions by modeling model computation,”arXiv preprint arXiv:2404.11534, 2024

work page arXiv 2024
[60]

Anchors: High-precision model-agnostic explanations,

M. T. Ribeiro, S. Singh, and C. Guestrin, “Anchors: High-precision model-agnostic explanations,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

work page 2018
[61]

Explanations based on the missing: Towards contrastive explanations with pertinent negatives,

A. Dhurandhar, P.-Y . Chen, R. Luss, C.-C. Tu, P. Ting, K. Shanmugam, and P. Das, “Explanations based on the missing: Towards contrastive explanations with pertinent negatives,”Advances in neural information processing systems, vol. 31, 2018

work page 2018
[62]

Learning global transparent models consistent with local contrastive ex- planations,

T. Pedapati, A. Balakrishnan, K. Shanmugam, and A. Dhurandhar, “Learning global transparent models consistent with local contrastive ex- planations,”Advances in neural information processing systems, vol. 33, pp. 3592–3602, 2020

work page 2020
[63]

Reliable visualization for deep speaker recognition,

P. Li, L. Li, A. Hamdulla, and D. Wang, “Reliable visualization for deep speaker recognition,”arXiv preprint arXiv:2204.03852, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

work page arXiv 2022
[64]

Visualizing Data Augmentation in Deep Speaker Recognition,

——, “Visualizing Data Augmentation in Deep Speaker Recognition,” inProc. INTERSPEECH 2023, 2023, pp. 2243–2247

work page 2023
[65]

How phonemes con- tribute to deep speaker models?

P. Li, T. Wang, L. Li, A. Hamdulla, and D. Wang, “How phonemes con- tribute to deep speaker models?” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2024, pp. 838–842

work page 2024
[66]

A study on visualization of voiceprint feature,

J. Zhang, L. He, X. Guo, and J. Ma, “A study on visualization of voiceprint feature,” inProc. INTERSPEECH 2023, 2023, pp. 2233–2237

work page 2023
[67]

A phonetic analysis of speaker verification systems through phoneme selection and integrated gradients,

T. Thebaud, G. Sierra, and M. Juan, Sarahand Tahon, “A phonetic analysis of speaker verification systems through phoneme selection and integrated gradients,” inSpeaker and Language Recognition Workshop- Odyssey, 2024

work page 2024
[68]

De- scribing the phonetics in the underlying speech attributes for deep and interpretable speaker recognition,

I. Ben-Amor, J.-F. Bonastre, B. O’Brien, and P.-M. Bousquet, “De- scribing the phonetics in the underlying speech attributes for deep and interpretable speaker recognition,” inInterspeech 2023, 2023

work page 2023
[69]

Expo: Explainable phonetic trait-oriented network for speaker verification,

Y . Ma, S. Wang, T. Liu, and H. Li, “Expo: Explainable phonetic trait-oriented network for speaker verification,”IEEE Signal Processing Letters, 2025

work page 2025
[70]

Explainable attribute-based speaker verification.arXiv preprint arXiv:2405.19796, 2024

X. Wu, C. Luu, P. Bell, and A. Rajan, “Explainable attribute-based speaker verification,”arXiv preprint arXiv:2405.19796, 2024

work page arXiv 2024
[71]

Phone-to-audio alignment without text: A semi-supervised approach,

J. Zhu, C. Zhang, and D. Jurgens, “Phone-to-audio alignment without text: A semi-supervised approach,”IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

work page 2022
[72]

V oxCeleb: A Large-Scale Speaker Identification Dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxCeleb: A Large-Scale Speaker Identification Dataset,” inProc. Interspeech 2017, 2017, pp. 2616–2620

work page 2017
[73]

V oxCeleb2: Deep Speaker Recognition,

J. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep Speaker Recognition,” inProc. Interspeech 2018, 2018, pp. 1086–1090

work page 2018
[74]

The speakers in the wild (sitw) speaker recognition database,

M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (sitw) speaker recognition database,” inInterspeech 2016, 2016, pp. 818–822

work page 2016
[75]

Librispeech: An asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

work page 2015
[76]

Accessed: March 5, 2024

The cmu pronouncing dictionary. Accessed: March 5, 2024. [Online]. Available: http://www.speech.cs.cmu.edu/cgi-bin/cmudict

work page 2024
[77]

Wespeaker: A research and production oriented speaker embedding learning toolkit,

H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023
[78]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

work page Pith review arXiv 2015
[79]

A study on data augmentation of reverberant speech for robust speech recognition,

T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5220–5224

work page 2017
[80]

Deep metric learning with angular loss,

J. Wang, F. Zhou, S. Wen, X. Liu, and Y . Lin, “Deep metric learning with angular loss,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2593–2601

work page 2017

Showing first 80 references.