Recognition: 1 theorem link
· Lean TheoremPhiNet: Speaker Verification with Phonetic Interpretability
Pith reviewed 2026-05-13 21:09 UTC · model grok-4.3
The pith
PhiNet adds phonetic-level explanations to speaker verification while matching the accuracy of black-box models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PhiNet enhances local and global interpretability by leveraging phonetic evidence in decision-making, supplying detailed phonetic-level comparisons that support manual inspection of speaker-specific features and explicit reasoning for verification outcomes, while delivering performance comparable to traditional black-box ASV models across VoxCeleb, SITW, and LibriSpeech.
What carries the argument
PhiNet, a speaker verification network that integrates phonetic evidence into its decision process to produce both local phonetic comparisons and global reasoning traces.
If this is right
- Verification decisions become traceable at the phonetic level, allowing direct manual checks of speaker features.
- Error analysis and hyperparameter tuning become simpler because the network supplies explicit phonetic reasoning.
- The system supports forensic applications by aligning automatic outputs with human comparison methods.
- Performance remains competitive with black-box models on standard benchmarks including VoxCeleb and SITW.
Where Pith is reading between the lines
- If the phonetic features prove stable across languages, the approach could support cross-lingual forensic verification.
- Integrating PhiNet-style explanations into other audio tasks might improve accountability in voice-based security systems.
- Controlled user studies comparing PhiNet outputs to traditional forensic reports could quantify gains in decision reliability.
Load-bearing premise
Phonetic evidence extracted by the network genuinely captures speaker-specific traits that human experts can use for forensic inspection without introducing new biases or accuracy loss.
What would settle it
A test set where PhiNet's verification error rate rises above black-box baselines or where human forensic analysts rate the phonetic explanations as uninformative or inconsistent with their own judgments.
Figures
read the original abstract
Despite remarkable progress, automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Motivated by how human experts perform forensic speaker comparison (FSC), we propose a speaker verification network with phonetic interpretability, PhiNet, designed to enhance both local and global interpretability by leveraging phonetic evidence in decision-making. For users, PhiNet provides detailed phonetic-level comparisons that enable manual inspection of speaker-specific features and facilitate a more critical evaluation of verification outcomes. For developers, it offers explicit reasoning behind verification decisions, simplifying error tracing and informing hyperparameter selection. In our experiments, we demonstrate PhiNet's interpretability with practical examples, including its application in analyzing the impact of different hyperparameters. We conduct both qualitative and quantitative evaluations of the proposed interpretability methods and assess speaker verification performance across multiple benchmark datasets, including VoxCeleb, SITW, and LibriSpeech. Results show that PhiNet achieves performance comparable to traditional black-box ASV models while offering meaningful, interpretable explanations for its decisions, bridging the gap between ASV and forensic analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents PhiNet, a neural network for automatic speaker verification (ASV) that integrates phonetic interpretability. Motivated by forensic speaker comparison practices, PhiNet leverages phonetic evidence to generate local explanations (detailed phonetic-level comparisons for manual inspection of speaker-specific features) and global explanations (explicit reasoning for verification decisions and hyperparameter impact). Experiments on VoxCeleb, SITW, and LibriSpeech are reported to demonstrate performance comparable to traditional black-box ASV models, supported by qualitative practical examples and both qualitative and quantitative evaluations of the interpretability methods.
Significance. If the reported results hold, this work could meaningfully advance accountable ASV systems by bridging them with forensic analysis through usable phonetic explanations. The dual emphasis on user-facing manual inspection and developer-facing error tracing is a clear strength, and the multi-benchmark evaluation plus hyperparameter analysis examples add practical value. The significance hinges on whether the phonetic components deliver genuine forensic utility without hidden accuracy costs or new biases.
major comments (2)
- [§4] §4 (Experiments): The central claim of 'performance comparable to traditional black-box ASV models' requires explicit reporting of metrics such as EER or min t-DCF, baseline comparisons (e.g., x-vector or ECAPA-TDNN), error bars, and ablation results isolating the phonetic module; without these, the empirical parity cannot be verified and is load-bearing for the paper's contribution.
- [§5] §5 (Interpretability Evaluation): The assertion that explanations are 'meaningful' and usable for forensic inspection rests on qualitative examples and architectural choice; quantitative support such as fidelity scores, consistency metrics, or a small user study with forensic experts is needed to substantiate actionability and rule out introduced biases.
minor comments (2)
- [Abstract] Abstract: Consider adding one sentence specifying how phonetic information is extracted or injected (e.g., phoneme posterior features or auxiliary phonetic loss) to improve immediate clarity.
- [Introduction] Notation: Ensure consistent definition of all acronyms (ASV, FSC) on first use and clarify any new symbols introduced for phonetic embeddings.
Simulated Author's Rebuttal
Thank you for your constructive review and for highlighting areas where the empirical and interpretability claims can be strengthened. We address each major comment below and commit to revisions that will make the supporting evidence explicit and verifiable.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The central claim of 'performance comparable to traditional black-box ASV models' requires explicit reporting of metrics such as EER or min t-DCF, baseline comparisons (e.g., x-vector or ECAPA-TDNN), error bars, and ablation results isolating the phonetic module; without these, the empirical parity cannot be verified and is load-bearing for the paper's contribution.
Authors: We agree that the current presentation of results is insufficient to substantiate the comparability claim. The revised manuscript will expand §4 with tables reporting EER and min t-DCF on VoxCeleb, SITW, and LibriSpeech, direct comparisons against x-vector and ECAPA-TDNN baselines, standard deviations across runs as error bars, and ablation studies that isolate the phonetic module's contribution. These additions will allow independent verification of performance parity. revision: yes
-
Referee: [§5] §5 (Interpretability Evaluation): The assertion that explanations are 'meaningful' and usable for forensic inspection rests on qualitative examples and architectural choice; quantitative support such as fidelity scores, consistency metrics, or a small user study with forensic experts is needed to substantiate actionability and rule out introduced biases.
Authors: We acknowledge the need for stronger quantitative grounding. The revised §5 will add fidelity scores measuring alignment between explanations and model decisions, consistency metrics across similar inputs, and explicit discussion of potential biases. We will also clarify the scope of the existing quantitative evaluations already present in the manuscript and, where feasible, include a small-scale expert review; otherwise we will state the current limitations transparently. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes PhiNet as a network architecture for speaker verification that incorporates phonetic interpretability, motivated by forensic practices. Its central claims rest on empirical evaluations across VoxCeleb, SITW, and LibriSpeech, reporting performance parity with black-box ASV models plus qualitative/quantitative interpretability assessments. No equations, parameter-fitting steps, or self-citation chains are visible that would reduce any prediction or uniqueness claim back to the inputs by construction. The derivation is therefore self-contained as an architectural and experimental contribution rather than a deductive loop.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PhiNet performs speaker verification by comparing enrollment and test utterances via phonetic traits e_i, t_i, cosine similarities s_i, and learned weights w_i to produce final score y
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Improving a gmm speaker verification system by phonetic weighting,
R. Auckenthaler, E. Parris, and M. Carey, “Improving a gmm speaker verification system by phonetic weighting,” in1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1999, pp. 313–316
work page 1999
-
[2]
New map estimators for speaker recognition
P. Kenny, M. Mihoubi, and P. Dumouchel, “New map estimators for speaker recognition.” inInterspeech. Citeseer, 2003, pp. 2961–2964
work page 2003
-
[3]
Discriminative phonemes for speaker identification,
E. S. Parris and M. J. Carey, “Discriminative phonemes for speaker identification,” in3rd International Conference on Spoken Language Processing (ICSLP 1994), 1994, pp. 1843–1846
work page 1994
-
[4]
Speaker verification based on broad phonetic categories,
S. S. Kajarekar and H. Hermansky, “Speaker verification based on broad phonetic categories,” inThe Speaker and Language Recognition Workshop (Odyssey 2001), 2001, pp. 201–206
work page 2001
-
[5]
Front- end factor analysis for speaker verification,
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- end factor analysis for speaker verification,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011
work page 2011
-
[6]
X- vectors: Robust dnn embeddings for speaker recognition,
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X- vectors: Robust dnn embeddings for speaker recognition,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333
work page 2018
-
[7]
But system description to voxceleb speaker recognition challenge 2019,
H. Zeinali, S. Wang, A. Silnova, P. Mat ˇejka, and O. Plchot, “But system description to voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019
-
[8]
B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” inProc. Interspeech 2020, 2020, pp. 3830– 3834
work page 2020
-
[9]
Forensic speaker comparison: A linguis- tic–acoustic perspective,
P. Foulkes and P. French, “Forensic speaker comparison: A linguis- tic–acoustic perspective,” inThe Oxford Handbook of Language and Law. Oxford University Press, 03 2012
work page 2012
-
[10]
The phonetic bases of speaker recognition,
F. Nolan, “The phonetic bases of speaker recognition,”Cambridge University Press, Cambridge, 1983
work page 1983
-
[11]
Introduction to forensic voice com- parison,
G. S. Morrison and E. Enzinger, “Introduction to forensic voice com- parison,” inThe Routledge handbook of phonetics. Routledge, 2019, pp. 599–634
work page 2019
-
[12]
The role of voice activity detection in forensic speaker verification,
F. Beritelli and A. Spadaccini, “The role of voice activity detection in forensic speaker verification,” in2011 17th International Conference on Digital Signal Processing (DSP), 2011, pp. 1–6
work page 2011
-
[13]
Deep neural networks based speaker modeling at different levels of phonetic granularity,
Y . Tian, L. He, M. Cai, W.-Q. Zhang, and J. Liu, “Deep neural networks based speaker modeling at different levels of phonetic granularity,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5440–5444
work page 2017
-
[14]
V oice comparison approaches for forensic application: A review,
T. C. Nagavi, P. Maheshaet al., “V oice comparison approaches for forensic application: A review,” in2023 Third International Conference on Secure Cyber Computing and Communication (ICSCCC). IEEE, 2023, pp. 797–802
work page 2023
-
[15]
CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking,
H. Wang, S. Zheng, Y . Chen, L. Cheng, and Q. Chen, “CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking,” inProc. INTERSPEECH 2023, 2023, pp. 5301–5305
work page 2023
-
[16]
A. B. Arrieta, N. D ´ıaz-Rodr´ıguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garc ´ıa, S. Gil-L ´opez, D. Molina, R. Benjaminset al., “Explainable artificial intelligence (xai): Concepts, taxonomies, opportu- nities and challenges toward responsible ai,”Information fusion, vol. 58, pp. 82–115, 2020
work page 2020
-
[17]
L. G. KERSTA, “V oiceprint identification,”Nature, vol. 196, no. 4861, p. 1253, 1962. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13
work page 1962
-
[18]
Speaker identification by speech spectrograms: some further observations,
R. H. Bolt, F. S. Cooper, E. E. David, P. B. Denes, J. M. Pickett, and K. N. Stevens, “Speaker identification by speech spectrograms: some further observations,”The Journal of the Acoustical Society of America, vol. 54, no. 2, pp. 531–534, 1973
work page 1973
-
[19]
Hollien,The acoustics of crime: The new science of forensic phonetics
H. Hollien,The acoustics of crime: The new science of forensic phonetics. Springer Science & Business Media, 2013
work page 2013
-
[20]
Rose,Forensic speaker identification
P. Rose,Forensic speaker identification. cRc Press, 2002
work page 2002
-
[21]
A. Hirson, P. French, and D. Howard, “Speech fundamental frequency over the telephone and face-to-face: Some implications for forensic phonetics 1,” inStudies in General and English Phonetics. Routledge, 2012, pp. 230–240
work page 2012
-
[22]
F0 statistics for 100 young male speakers of standard southern british english,
T. Hudson, G. De Jong, K. McDougall, P. Harrison, and F. Nolan, “F0 statistics for 100 young male speakers of standard southern british english,” inProceedings of the 16th international congress of phonetic sciences, vol. 6, no. 10. Citeseer, 2007
work page 2007
-
[23]
Fundamental frequency: how speaker-specific is it?
A. Braun, “Fundamental frequency: how speaker-specific is it?”Beitr ¨age zur Phonetik und Linguistik, vol. 64, pp. 9–23, 1995
work page 1995
-
[24]
Z. Yang, M. Du, R. Su, X. Liu, N. Yan, and L. Wang, “A phone-level speaker embedding extraction framework with multi-gate mixture-of- experts based multi-task learning,” in2022 13th International Sympo- sium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2022, pp. 240–244
work page 2022
-
[25]
A survey on neural network interpretability,
Y . Zhang, P. Ti ˇno, A. Leonardis, and K. Tang, “A survey on neural network interpretability,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 5, no. 5, pp. 726–742, 2021
work page 2021
-
[26]
Explainable ai (xai): A systematic meta- survey of current challenges and future opportunities,
W. Saeed and C. Omlin, “Explainable ai (xai): A systematic meta- survey of current challenges and future opportunities,”Knowledge-Based Systems, vol. 263, p. 110273, 2023
work page 2023
-
[27]
Self-interpretable model with transformation equivariant interpretation,
Y . Wang and X. Wang, “Self-interpretable model with transformation equivariant interpretation,”Advances in Neural Information Processing Systems, vol. 34, pp. 2359–2372, 2021
work page 2021
-
[28]
Towards robust interpretability with self-explaining neural networks,
D. Alvarez Melis and T. Jaakkola, “Towards robust interpretability with self-explaining neural networks,”Advances in neural information processing systems, vol. 31, 2018
work page 2018
-
[29]
Extracting tree-structured representations of trained networks,
M. Craven and J. Shavlik, “Extracting tree-structured representations of trained networks,”Advances in neural information processing systems, vol. 8, 1995
work page 1995
-
[30]
Survey and critique of techniques for extracting rules from trained artificial neural networks,
R. Andrews, J. Diederich, and A. B. Tickle, “Survey and critique of techniques for extracting rules from trained artificial neural networks,” Knowledge-based systems, vol. 8, no. 6, pp. 373–389, 1995
work page 1995
-
[31]
Deep inside convolutional networks: visualising image classification models and saliency maps,
K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: visualising image classification models and saliency maps,” inProceedings of the International Conference on Learning Represen- tations (ICLR). ICLR, 2014
work page 2014
-
[32]
Grad-cam: Visual explanations from deep networks via gradient-based localization,
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 618–626
work page 2017
-
[33]
Axiomatic attribution for deep networks,
M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” inInternational conference on machine learning. PMLR, 2017, pp. 3319–3328
work page 2017
-
[34]
Score-cam: Score-weighted visual explanations for convo- lutional neural networks,
H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convo- lutional neural networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 24–25
work page 2020
-
[35]
Layercam: Exploring hierarchical class activation maps for localization,
P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layercam: Exploring hierarchical class activation maps for localization,”IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021
work page 2021
-
[36]
M. T. Ribeiro, S. Singh, and C. Guestrin, ““why should i trust you?” explaining the predictions of any classifier,” inProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144
work page 2016
-
[37]
L. S. Shapley, “A value for n-person games,”Contribution to the Theory of Games, vol. 2, 1953
work page 1953
-
[38]
A unified approach to interpreting model predictions,
S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,”Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[39]
RISE: Randomized Input Sampling for Explanation of Black-box Models
V . Petsiuk, “Rise: Randomized input sampling for explanation of black- box models,”arXiv preprint arXiv:1806.07421, 2018
work page Pith review arXiv 2018
-
[40]
Understanding deep networks via extremal perturbations and smooth masks,
R. Fong, M. Patrick, and A. Vedaldi, “Understanding deep networks via extremal perturbations and smooth masks,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2950– 2958
work page 2019
-
[41]
Visualizing deep neural network by alternately image blurring and deblurring,
F. Wang, H. Liu, and J. Cheng, “Visualizing deep neural network by alternately image blurring and deblurring,”Neural Networks, vol. 97, pp. 162–172, 2018
work page 2018
-
[42]
Interpretable convolutional neural networks,
Q. Zhang, Y . N. Wu, and S.-C. Zhu, “Interpretable convolutional neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8827–8836
work page 2018
-
[43]
Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks,
R. Fong and A. Vedaldi, “Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8730–8738
work page 2018
-
[44]
What is one grain of sand in the desert? analyzing individual neurons in deep nlp models,
F. Dalvi, N. Durrani, H. Sajjad, Y . Belinkov, A. Bau, and J. Glass, “What is one grain of sand in the desert? analyzing individual neurons in deep nlp models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 6309–6317
work page 2019
-
[45]
Representer point selection for explaining deep neural networks,
C.-K. Yeh, J. Kim, I. E.-H. Yen, and P. K. Ravikumar, “Representer point selection for explaining deep neural networks,”Advances in neural information processing systems, vol. 31, 2018
work page 2018
-
[46]
Understanding black-box predictions via influence functions,
P. W. Koh and P. Liang, “Understanding black-box predictions via influence functions,” inInternational conference on machine learning. PMLR, 2017, pp. 1885–1894
work page 2017
-
[47]
O. Li, H. Liu, C. Chen, and C. Rudin, “Deep learning for case- based reasoning through prototypes: A neural network that explains its predictions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018
work page 2018
-
[48]
This looks like that: deep learning for interpretable image recognition,
C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This looks like that: deep learning for interpretable image recognition,”Advances in neural information processing systems, vol. 32, 2019
work page 2019
-
[49]
Interpretable image classification with differentiable prototypes assignment,
D. Rymarczyk, Ł. Struski, M. G ´orszczak, K. Lewandowska, J. Tabor, and B. Zieli ´nski, “Interpretable image classification with differentiable prototypes assignment,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 351–368
work page 2022
-
[50]
Deformable protopnet: An in- terpretable image classifier using deformable prototypes,
J. Donnelly, A. J. Barnett, and C. Chen, “Deformable protopnet: An in- terpretable image classifier using deformable prototypes,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 265–10 275
work page 2022
-
[51]
Pip-net: Patch-based intuitive prototypes for interpretable image classification,
M. Nauta, J. Schl ¨otterer, M. Van Keulen, and C. Seifert, “Pip-net: Patch-based intuitive prototypes for interpretable image classification,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2744–2753
work page 2023
-
[52]
Xprotonet: diagnosis in chest radiography with global and local explanations,
E. Kim, S. Kim, M. Seo, and S. Yoon, “Xprotonet: diagnosis in chest radiography with global and local explanations,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 719–15 728
work page 2021
-
[53]
P. W. Koh, T. Nguyen, Y . S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang, “Concept bottleneck models,” inInternational conference on machine learning. PMLR, 2020, pp. 5338–5348
work page 2020
-
[54]
Y . Yang, A. Panagopoulou, S. Zhou, D. Jin, C. Callison-Burch, and M. Yatskar, “Language in a bottle: Language model guided concept bottlenecks for interpretable image classification,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 187–19 197
work page 2023
-
[55]
Feature importance ranking for deep learning,
M. Wojtas and K. Chen, “Feature importance ranking for deep learning,” Advances in neural information processing systems, vol. 33, pp. 5105– 5114, 2020
work page 2020
-
[56]
Learning deep attribution priors based on prior knowledge,
E. Weinberger, J. Janizek, and S.-I. Lee, “Learning deep attribution priors based on prior knowledge,”Advances in Neural Information Processing Systems, vol. 33, pp. 14 034–14 045, 2020
work page 2020
-
[57]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt, “Interpretability in the wild: a circuit for indirect object identification in gpt-2 small,”arXiv preprint arXiv:2211.00593, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[58]
Progress measures for grokking via mechanistic interpretability
N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt, “Progress measures for grokking via mechanistic interpretability,”arXiv preprint arXiv:2301.05217, 2023
work page internal anchor Pith review arXiv 2023
-
[59]
Decomposing and editing predictions by modeling model computation,
H. Shah, A. Ilyas, and A. Madry, “Decomposing and editing predictions by modeling model computation,”arXiv preprint arXiv:2404.11534, 2024
-
[60]
Anchors: High-precision model-agnostic explanations,
M. T. Ribeiro, S. Singh, and C. Guestrin, “Anchors: High-precision model-agnostic explanations,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018
work page 2018
-
[61]
Explanations based on the missing: Towards contrastive explanations with pertinent negatives,
A. Dhurandhar, P.-Y . Chen, R. Luss, C.-C. Tu, P. Ting, K. Shanmugam, and P. Das, “Explanations based on the missing: Towards contrastive explanations with pertinent negatives,”Advances in neural information processing systems, vol. 31, 2018
work page 2018
-
[62]
Learning global transparent models consistent with local contrastive ex- planations,
T. Pedapati, A. Balakrishnan, K. Shanmugam, and A. Dhurandhar, “Learning global transparent models consistent with local contrastive ex- planations,”Advances in neural information processing systems, vol. 33, pp. 3592–3602, 2020
work page 2020
-
[63]
Reliable visualization for deep speaker recognition,
P. Li, L. Li, A. Hamdulla, and D. Wang, “Reliable visualization for deep speaker recognition,”arXiv preprint arXiv:2204.03852, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14
-
[64]
Visualizing Data Augmentation in Deep Speaker Recognition,
——, “Visualizing Data Augmentation in Deep Speaker Recognition,” inProc. INTERSPEECH 2023, 2023, pp. 2243–2247
work page 2023
-
[65]
How phonemes con- tribute to deep speaker models?
P. Li, T. Wang, L. Li, A. Hamdulla, and D. Wang, “How phonemes con- tribute to deep speaker models?” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2024, pp. 838–842
work page 2024
-
[66]
A study on visualization of voiceprint feature,
J. Zhang, L. He, X. Guo, and J. Ma, “A study on visualization of voiceprint feature,” inProc. INTERSPEECH 2023, 2023, pp. 2233–2237
work page 2023
-
[67]
T. Thebaud, G. Sierra, and M. Juan, Sarahand Tahon, “A phonetic analysis of speaker verification systems through phoneme selection and integrated gradients,” inSpeaker and Language Recognition Workshop- Odyssey, 2024
work page 2024
-
[68]
I. Ben-Amor, J.-F. Bonastre, B. O’Brien, and P.-M. Bousquet, “De- scribing the phonetics in the underlying speech attributes for deep and interpretable speaker recognition,” inInterspeech 2023, 2023
work page 2023
-
[69]
Expo: Explainable phonetic trait-oriented network for speaker verification,
Y . Ma, S. Wang, T. Liu, and H. Li, “Expo: Explainable phonetic trait-oriented network for speaker verification,”IEEE Signal Processing Letters, 2025
work page 2025
-
[70]
Explainable attribute-based speaker verification.arXiv preprint arXiv:2405.19796, 2024
X. Wu, C. Luu, P. Bell, and A. Rajan, “Explainable attribute-based speaker verification,”arXiv preprint arXiv:2405.19796, 2024
-
[71]
Phone-to-audio alignment without text: A semi-supervised approach,
J. Zhu, C. Zhang, and D. Jurgens, “Phone-to-audio alignment without text: A semi-supervised approach,”IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022
work page 2022
-
[72]
V oxCeleb: A Large-Scale Speaker Identification Dataset,
A. Nagrani, J. S. Chung, and A. Zisserman, “V oxCeleb: A Large-Scale Speaker Identification Dataset,” inProc. Interspeech 2017, 2017, pp. 2616–2620
work page 2017
-
[73]
V oxCeleb2: Deep Speaker Recognition,
J. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep Speaker Recognition,” inProc. Interspeech 2018, 2018, pp. 1086–1090
work page 2018
-
[74]
The speakers in the wild (sitw) speaker recognition database,
M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (sitw) speaker recognition database,” inInterspeech 2016, 2016, pp. 818–822
work page 2016
-
[75]
Librispeech: An asr corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210
work page 2015
-
[76]
The cmu pronouncing dictionary. Accessed: March 5, 2024. [Online]. Available: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
work page 2024
-
[77]
Wespeaker: A research and production oriented speaker embedding learning toolkit,
H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5
work page 2023
-
[78]
MUSAN: A Music, Speech, and Noise Corpus
D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015
work page Pith review arXiv 2015
-
[79]
A study on data augmentation of reverberant speech for robust speech recognition,
T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5220–5224
work page 2017
-
[80]
Deep metric learning with angular loss,
J. Wang, F. Zhou, S. Wen, X. Liu, and Y . Lin, “Deep metric learning with angular loss,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2593–2601
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.