pith. sign in

arxiv: 2606.10912 · v1 · pith:3AKKHCLRnew · submitted 2026-06-09 · 💻 cs.SD · cs.AI· cs.CR· cs.LG

What Do Deepfake Speech Detectors Actually Hear?

Pith reviewed 2026-06-27 11:36 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CRcs.LG
keywords deepfake detectionexplainable AIIntegrated Gradientsspeech deepfakesaudio analysisWavLMcausal interventionASVspoof
0
0 comments X

The pith

Deepfake speech detectors rely on distinct cues despite similar performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops an explainability method for deepfake speech detectors that uses Integrated Gradients applied to time-aligned self-supervised representations to identify which parts of the audio drive the decision. It applies this to three detectors and finds they focus on different aspects of the signal: AASIST on non-speech and environment cues, CA-MHFA on localized phoneme artifacts, and SLS on word boundaries and spectral integrity. The findings are confirmed when masking those regions causes the detectors to perform worse. Readers would care because it reveals the actual evidence used by these models rather than assuming they detect synthesis artifacts directly.

Core claim

The central discovery is that three WavLM-based detectors with similar performance on ASVspoof 5 actually rely on different cues in the audio: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. This is established through an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations, with semantic annotation of high-attribution regions and validation via causal masking experiments that degrade performance.

What carries the argument

An audio-native explainability pipeline applying Integrated Gradients to time-aligned self-supervised representations to localize and annotate decision evidence over time in audio samples.

If this is right

  • Different detectors may be vulnerable to distinct manipulation strategies targeting their specific cues.
  • Benchmark performance alone does not reveal whether detectors are using consistent or meaningful evidence for deepfake identification.
  • Removing the identified cue regions through masking leads to measurable drops in detection accuracy, confirming the role of those cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The variation in cues suggests detectors may be picking up on dataset artifacts specific to ASVspoof 5 rather than general deepfake properties.
  • Combining detectors that focus on complementary cues could lead to more robust systems.
  • The pipeline could be extended to other audio tasks to understand model decision processes beyond speech deepfakes.

Load-bearing premise

The Integrated Gradients method applied to time-aligned self-supervised representations correctly identifies the regions containing the cues that the detector uses for its decision, and the manual annotations accurately reflect those cues without significant bias.

What would settle it

Observing no significant performance drop when the highest-attribution regions are masked in the audio inputs would indicate that those regions do not contain the primary decision evidence.

Figures

Figures reproduced from arXiv: 2606.10912 by Anton Firc, Jakub Re\v{s}, Kamil Malinka, Martin Pere\v{s}\'ini, Veronika Jirmusov\'a, Vojt\v{e}ch Stan\v{e}k.

Figure 1
Figure 1. Figure 1: Spoof IG attributions (smoothed) of the three examined detectors for the file E 0005076209 (spoofed recording from the High￾confidence correct predictions category). The highlighted regions represent the primary cue regions identified by one of the annotators. 2. High-confidence errors: To investigate severe model failures, we selected 20 samples where all three detectors were con￾fidently wrong (similarly… view at source ↗
read the original abstract

Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 3 minor

Summary. The paper introduces an audio-native explainability pipeline that applies Integrated Gradients to time-aligned WavLM representations to localize decision evidence in three WavLM-based deepfake speech detectors (AASIST, CA-MHFA, SLS) evaluated on ASVspoof 5. It reports that the detectors rely on qualitatively different cues despite similar performance—AASIST on non-speech/environmental signals, CA-MHFA on localized phoneme artifacts, and SLS on word boundaries and spectral integrity—and validates these interpretations via manual annotation of high-attribution regions followed by causal masking that produces performance degradation.

Significance. If the attributions prove faithful and the semantic labels accurate, the work supplies concrete, comparative insight into why otherwise comparable detectors succeed or fail, which could guide architecture choices and robustness improvements. The inclusion of a masking-based validation step is a positive step beyond purely post-hoc speculation.

major comments (4)
  1. [Section 3] Attribution method (Section 3): No quantitative faithfulness checks for the Integrated Gradients attributions are reported (e.g., comparison against ground-truth synthetic artifacts, sanity checks, or alternative explainers such as occlusion or SHAP), which is load-bearing because the entire cue-localization and semantic-interpretation pipeline rests on these attributions being faithful to the models' actual decision evidence.
  2. [Section 4] Cue annotation procedure (Section 4): The manual semantic labeling of highest-attribution segments lacks any reported inter-annotator agreement statistics, annotation guidelines, or blinding protocol; this subjectivity directly affects the central claim that the three detectors specialize in distinct cue types (non-speech vs. phoneme artifacts vs. word boundaries).
  3. [Section 5] Causal masking validation (Section 5): The observed performance drops after masking high-attribution regions could arise from distribution shift or edge artifacts introduced into the shared WavLM front-end rather than from removal of the claimed semantic cues; no control experiments (random masking of equal duration or low-attribution regions) are described to isolate the effect.
  4. [Section 4.3] Model comparison (Section 4.3): Because AASIST, CA-MHFA, and SLS all share the identical WavLM front-end, any systematic differences in how masking interacts with their respective downstream heads could produce the reported degradation pattern without implying the claimed semantic specialization; additional analysis separating front-end from head contributions is needed to support the central claim.
minor comments (3)
  1. [Abstract and Section 3] The abstract and methods should explicitly state the exact WavLM layer(s) used for time-alignment and the precise masking procedure (e.g., zeroing, noise replacement, or interpolation).
  2. [Figures 3-5 and Table 2] Figure captions and tables reporting attribution statistics should include error bars or confidence intervals across multiple seeds or folds.
  3. [Introduction] A brief discussion of related audio explainability methods (e.g., prior work on gradient-based or attention-based explanations for speech models) is missing from the introduction.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. The suggestions identify opportunities to strengthen the validation of our explainability pipeline. We address each major comment below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Section 3] Attribution method (Section 3): No quantitative faithfulness checks for the Integrated Gradients attributions are reported (e.g., comparison against ground-truth synthetic artifacts, sanity checks, or alternative explainers such as occlusion or SHAP), which is load-bearing because the entire cue-localization and semantic-interpretation pipeline rests on these attributions being faithful to the models' actual decision evidence.

    Authors: We agree that quantitative faithfulness metrics would increase confidence in the attributions. In the revised version we will add (i) occlusion-based attribution comparisons on a subset of samples and (ii) sanity checks consisting of model-weight randomization and input-permutation tests, following standard practices in the XAI literature. These additions will be reported in an expanded Section 3. revision: yes

  2. Referee: [Section 4] Cue annotation procedure (Section 4): The manual semantic labeling of highest-attribution segments lacks any reported inter-annotator agreement statistics, annotation guidelines, or blinding protocol; this subjectivity directly affects the central claim that the three detectors specialize in distinct cue types (non-speech vs. phoneme artifacts vs. word boundaries).

    Authors: The annotations were performed by two authors with speech-processing expertise following an internal protocol that we will document in the appendix. We will also report Cohen’s kappa on a 20 % overlap subset in the revision. Because the primary evidence for cue specialization is the causal-masking performance drop rather than the labels themselves, we view the annotation step as interpretive support rather than the sole foundation of the claim. revision: partial

  3. Referee: [Section 5] Causal masking validation (Section 5): The observed performance drops after masking high-attribution regions could arise from distribution shift or edge artifacts introduced into the shared WavLM front-end rather than from removal of the claimed semantic cues; no control experiments (random masking of equal duration or low-attribution regions) are described to isolate the effect.

    Authors: We will add two control conditions in the revised Section 5: (a) random masking of equal total duration and (b) masking of the lowest-attribution regions. These controls will be run on the same evaluation set and reported alongside the original high-attribution masking results to demonstrate specificity. revision: yes

  4. Referee: [Section 4.3] Model comparison (Section 4.3): Because AASIST, CA-MHFA, and SLS all share the identical WavLM front-end, any systematic differences in how masking interacts with their respective downstream heads could produce the reported degradation pattern without implying the claimed semantic specialization; additional analysis separating front-end from head contributions is needed to support the central claim.

    Authors: Because Integrated Gradients is applied to the time-aligned WavLM representations that serve as input to each downstream head, the observed attribution differences already reflect how each head selectively weights the same front-end features. We will add a short analysis that compares attribution statistics before and after the first head layer to further isolate head-specific selection. Full ablation of the front-end would require retraining, which lies outside the scope of the current study. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical application of standard methods on public benchmark.

full rationale

The paper applies Integrated Gradients (a pre-existing attribution technique) to time-aligned WavLM features from three published detectors, performs manual semantic annotation of high-attribution segments, and validates via masking on the ASVspoof 5 benchmark. No equations, fitted parameters, or self-citations are used to derive the reported cue semantics; the central findings are observational and externally falsifiable against the benchmark data. No self-definitional loops, fitted-input predictions, or load-bearing self-citation chains appear in the described derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the faithfulness of Integrated Gradients for localizing decisions in self-supervised audio representations and on the reliability of human semantic labeling of attribution peaks; both are standard domain assumptions in explainable AI rather than new postulates.

axioms (1)
  • domain assumption Integrated Gradients attributions on time-aligned WavLM features correctly identify the audio regions driving detector decisions
    Invoked when the pipeline is introduced and when highest-attribution regions are selected for annotation.

pith-pipeline@v0.9.1-grok · 5713 in / 1374 out tokens · 27752 ms · 2026-06-27T11:36:02.760360+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 1 canonical work pages

  1. [1]

    In re- sponse, deepfake speech detectors are employed as the primary defense [1, 2, 3]

    Introduction High-quality speech deepfakes are now easy to generate. In re- sponse, deepfake speech detectors are employed as the primary defense [1, 2, 3]. A key limitation is that these systems typically output a single score [4, 5], with little insight intowhya record- ing received the score,wherein the utterance the detector finds evidence, orwhattype...

  2. [2]

    Pretrained models like Wav2Vec 2.0 [15, 16] and WavLM [7, 17] are favored for their ability to extract rich speaker representations directly from raw audio

    Background Current deepfake detection systems rely on architectures that use SSL front-ends [14]. Pretrained models like Wav2Vec 2.0 [15, 16] and WavLM [7, 17] are favored for their ability to extract rich speaker representations directly from raw audio. These features are processed most often by AASIST [8, 18, 19], SLS [6, 20], or (CA-)MHFA [10, 11, 21]....

  3. [3]

    Method To interpret the decision-making logic of deepfake detectors, we employ the Integrated Gradients (IG) [13] method. IG enables precise attribution of decision scores, which is crucial for dis- tinguishing whether a model detects genuine synthesis artifacts or merely exploits channel-specific noise. 3.1. Integrated Gradients for SSL Representations G...

  4. [4]

    The annotators can input a gen- eral observation about the recording

    The analyzed recording and its spectrogram are presented to the annotators for inspection. The annotators can input a gen- eral observation about the recording

  5. [5]

    •Primary cue:Annotators identify a primary segment that corresponds to the highest IG attributions

    A series of annotations is gathered for each of the detectors. •Primary cue:Annotators identify a primary segment that corresponds to the highest IG attributions. •Cue type:Each identified cue is assigned a type from a predefined set: local glitch, phoneme content/articulation, voiced-unvoiced transition, silence, breath, channel/codec noise, spectral art...

  6. [6]

    The protocol is designed to both localize the dominant evi- dence driving the detector’s decision and assign an interpretable semantic label to that evidence

    Finally, for the same recording, similarity or disparity be- tween the cues identified by different detectors is assessed. The protocol is designed to both localize the dominant evi- dence driving the detector’s decision and assign an interpretable semantic label to that evidence. Our method is not model- agnostic and focuses on SSL-based models. But, imp...

  7. [7]

    All systems utilize the pre- trained WavLM Base+ model as the front-end feature extractor

    Experimental Setup We evaluate three modern deepfake detection architectures: AASIST [8], Context-Aware MHFA (CA-MHFA) [10], and Sensitive Layer Selection (SLS) [6]. All systems utilize the pre- trained WavLM Base+ model as the front-end feature extractor. To ensure optimal adaptation, we employ a joint training strat- egy in which both the SSL front-end ...

  8. [8]

    s” fricative, CA-MHFA pinpointed a specific “e

    Results Across the 100 examined recordings, we observe two dataset- level patterns: (1) most high-confidence errors are associated with aggressive compression; (2) the attack A28 (pretrained YourTTS [31]) dominates the high-confidence error category. Regarding the examined models, the IG annotations show that the three detectors rely on distinct cues, eff...

  9. [9]

    Conclusion We introduce an audio-native explainability pipeline that moves beyond speculative reasoning and empirically demonstrates which specific cues the detectors focus on. Through rigor- ous manual annotations and causal experimental validation, we identify the primary cues of the examined detectors and assign them semantic meaning: •AASISTacts as an...

  10. [10]

    Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Re- public

    Acknowledgments This work was supported by the Brno University of Technology internal project FIT-S-26-9011. Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Re- public

  11. [11]

    The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

    Generative AI Use Disclosure During the preparation of this work, the authors used Generative AI Models (specifically Google Gemini, ChatGPT, and Gram- marly) for language editing and text refinement. The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

  12. [12]

    ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,

    X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. H. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 1–8

  13. [13]

    Deepfake speech de- tection: A spectrogram analysis,

    A. Firc, K. Malinka, and P. Han ´aˇcek, “Deepfake speech de- tection: A spectrogram analysis,” inProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, ser. SAC ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1312–1320

  14. [14]

    Resilience of voice assistants to synthetic speech,

    K. Malinkaet al., “Resilience of voice assistants to synthetic speech,” inComputer Security – ESORICS 2024. Cham: Springer Nature Switzerland, 2024, pp. 66–84

  15. [15]

    Intema system description for the ASVspoof5 Challenge: power weighted score fusion,

    A. Aliyev and A. Kondratev, “Intema system description for the ASVspoof5 Challenge: power weighted score fusion,” inThe Automatic Speaker Verification Spoofing Countermeasures Work- shop (ASVspoof 2024), 2024, pp. 152–157

  16. [16]

    Enhancing spoofing detection in ASVspoof 5 Workshop 2024: fusion of WavLM- ResNet18-SA for optimal performance against speech deepfakes,

    P.-C. Chan, W.-Y . Chen, and J.-C. Wang, “Enhancing spoofing detection in ASVspoof 5 Workshop 2024: fusion of WavLM- ResNet18-SA for optimal performance against speech deepfakes,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop, 2024, pp. 158–162

  17. [17]

    Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier,

    Q. Zhang, S. Wen, and T. Hu, “Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier,” inProceedings of the 32nd ACM International Conference on Multimedia, ser. MM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 6765–6773. [Online]. Available: https: //doi.org/10.1145/3664647.3681345

  18. [18]

    Exploring WavLM back-ends for speech spoofing and deepfake detection,

    T. Stourbe, V . Miara, T. Lepage, and R. Dehak, “Exploring WavLM back-ends for speech spoofing and deepfake detection,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop, 2024, pp. 72–78

  19. [19]

    AASIST: Audio Anti-Spoofing Us- ing Integrated Spectro-Temporal Graph Attention Networks,

    J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “AASIST: Audio Anti-Spoofing Us- ing Integrated Spectro-Temporal Graph Attention Networks,” in ICASSP 2022 - 2022 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2022, pp. 6367– 6371

  20. [20]

    AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,

    K. Borodin, V . Kudryavtsev, D. Korzh, A. Efimenko, G. Mkrtchian, M. Gorodnichev, and O. Y . Rogov, “AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 48–55

  21. [21]

    CA-MHFA: A Context-Aware Multi-Head Fac- torized Attentive Pooling for SSL-Based Speaker Verification,

    J. Peng, L. Mo ˇsner, L. Zhang, O. Plchot, T. Stafylakis, L. Burget, and J. ˇCernock´y, “CA-MHFA: A Context-Aware Multi-Head Fac- torized Attentive Pooling for SSL-Based Speaker Verification,” inICASSP , IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Hyderabad: IEEE Signal Processing Society, 2025, pp. 1–5. [Online...

  22. [22]

    BUT systems and analyses for the ASVspoof 5 Challenge,

    J. Rohdin, L. Zhang, P. Old ˇrich, V . Stanˇek, D. Mihola, J. Peng, T. Stafylakis, D. Beveraki, A. Silnova, J. Brukner, and L. Bur- get, “BUT systems and analyses for the ASVspoof 5 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 24–31

  23. [23]

    Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing,

    J. Peng, L. Zhang, J. Han, O. Plchot, J. Rohdin, T. Stafylakis, S. Wang, and J. ˇCernock´y, “Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing,” 2025. [Online]. Available: https://arxiv.org/abs/ 2508.16232

  24. [24]

    Axiomatic attribution for deep networks,

    M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” inProceedings of the 34th International Conference on Machine Learning - Volume 70, ser. ICML’17. JMLR.org, 2017, p. 3319–3328

  25. [25]

    Evaluation framework for deepfake speech detection: a comparative study of state-of-the-art deepfake speech detectors,

    A. Firc, K. Malinka, and P. Han ´aˇcek, “Evaluation framework for deepfake speech detection: a comparative study of state-of-the-art deepfake speech detectors,”Cybersecurity, vol. 8, no. 1, 2025

  26. [26]

    wav2vec 2.0: a framework for self-supervised learning of speech representa- tions,

    A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representa- tions,” inProceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

  27. [27]

    SZU-AFS an- tispoofing system for the ASVspoof 5 Challenge,

    Y . Xu, J. Zhong, S. Zheng, Z. Liu, and B. Li, “SZU-AFS an- tispoofing system for the ASVspoof 5 Challenge,” inThe Auto- matic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 64–71

  28. [28]

    WavLM: Large-Scale Self-Supervised Pre- Training for Full Stack Speech Processing,

    S. Chenet al., “WavLM: Large-Scale Self-Supervised Pre- Training for Full Stack Speech Processing,”IEEE Journal of Se- lected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  29. [29]

    A single end-to-end voice anti-spoofing model with graph attention and feature aggrega- tion for ASVspoof 5 Challenge,

    W. Xia, H. Peng, L. Li, and Y . Ren, “A single end-to-end voice anti-spoofing model with graph attention and feature aggrega- tion for ASVspoof 5 Challenge,” inThe Automatic Speaker Ver- ification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 124–130

  30. [30]

    Diffuse or Confuse: A Dif- fusion Deepfake Speech Dataset,

    A. Firc, K. Malinka, and P. Han ´aˇcek, “Diffuse or Confuse: A Dif- fusion Deepfake Speech Dataset,” in2024 International Confer- ence of the Biometrics Special Interest Group (BIOSIG). IEEE, Sep. 2024, p. 1–7

  31. [31]

    SCDF: A Speaker Characteristics Deepfake Speech Dataset for Bias Analysis,

    V . Stanˇek, K. Srna, A. Firc, and K. Malinka, “SCDF: A Speaker Characteristics Deepfake Speech Dataset for Bias Analysis,” in BIOSIG 2025. Gesellschaft f ¨ur Informatik e.V ., 2025

  32. [32]

    Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors,

    V . Stan ˇek, M. Pere ˇs´ıni, L. Sekanina, A. Firc, and K. Malinka, “Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors,” 2026. [Online]. Available: https://arxiv.org/abs/2604. 01330

  33. [33]

    ASVspoof 5 Challenge: advanced ResNet architectures for robust voice spoofing detec- tion,

    A.-T. Dao, M. Rouvier, and D. Matrouf, “ASVspoof 5 Challenge: advanced ResNet architectures for robust voice spoofing detec- tion,” inThe Automatic Speaker Verification Spoofing Counter- measures Workshop (ASVspoof 2024), 2024, pp. 163–169

  34. [34]

    Exploring generalization to unseen au- dio data for spoofing: insights from SSL models,

    A. Kulkarni, H. M. Tran, A. Kulkarni, S. Dowerah, D. Lo- live, and M. M. Doss, “Exploring generalization to unseen au- dio data for spoofing: insights from SSL models,” inThe Auto- matic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 86–93

  35. [35]

    Deepfakes as a threat to a speaker and facial recognition: An overview of tools and attack vectors,

    A. Firc, K. Malinka, and P. Han ´aˇcek, “Deepfakes as a threat to a speaker and facial recognition: An overview of tools and attack vectors,”Heliyon, vol. 9, no. 4, p. e15090, 2023

  36. [36]

    A Unified Approach to Inter- preting Model Predictions,

    S. M. Lundberg and S.-I. Lee, “A Unified Approach to Inter- preting Model Predictions,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Avail- able: https://proceedings.neurips.cc/paper files/paper/2017/...

  37. [37]

    Explaining Deep Learning Models for Spoofing and Deepfake Detection with Shapley Additive Explanations,

    W. Ge, J. Patino, M. Todisco, and N. Evans, “Explaining Deep Learning Models for Spoofing and Deepfake Detection with Shapley Additive Explanations,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2022, pp. 6387–6391

  38. [38]

    ”Why should I trust you?

    M. T. Ribeiro, S. Singh, and C. Guestrin, “”Why should I trust you?”: Explaining the predictions of any classifier,” inProceed- ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144

  39. [39]

    Explainable Deepfake and Spoofing Detection: An Attack Analysis Using SHapley Additive exPlanations,

    W. Ge, M. Todisco, and N. Evans, “Explainable Deepfake and Spoofing Detection: An Attack Analysis Using SHapley Additive exPlanations,” inThe Speaker and Language Recognition Work- shop (Odyssey 2022), 2022, pp. 70–76

  40. [40]

    Recommendation ITU- T P.800.2: Mean opinion score interpretation and reporting,

    International Telecommunication Union, “Recommendation ITU- T P.800.2: Mean opinion score interpretation and reporting,” In- ternational Telecommunication Union, Telecommunication Stan- dardization Sector (ITU-T), Geneva, Switzerland, Recommenda- tion P.800.2, May 2013

  41. [41]

    Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

    H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6382–6386

  42. [42]

    Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,

    E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. G ¨olge, and M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” inInternational Con- ference on Machine Learning. PMLR, 2022, pp. 2709–2720