pith. machine review for the scientific record. sign in

arxiv: 2605.10153 · v1 · submitted 2026-05-11 · 💻 cs.SD · cs.LG

Recognition: 3 theorem links

· Lean Theorem

APEX: Audio Prototype EXplanations for Classification Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:02 UTC · model grok-4.3

classification 💻 cs.SD cs.LG
keywords audio classificationexplainable AIprototype explanationspost-hoc methodsspectrogramsacoustic similarityinterpretability
0
0 comments X

The pith

APEX generates post-hoc explanations for audio classifiers by disentangling prototypes into four acoustic perspectives without retraining the model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces APEX as a framework to interpret decisions from pre-trained audio classification models. It finds representative audio examples as prototypes and separates explanations into four views that focus on different aspects of sound: localization of short events, patterns over time, emphasis on certain frequencies, and combinations of time and frequency. The approach requires no changes to the original model and leaves its predictions unchanged. This matters because audio signals differ from images in fundamental ways, so methods borrowed from vision often miss key acoustic properties and deliver less clear insights.

Core claim

APEX is a post-hoc framework for interpreting pre-trained audio classifiers by generating explanations based on prototypes disentangled into four perspectives: square-based prototypes to localize transient events, time-based prototypes for temporal patterns, frequency-based prototypes highlighting spectral bands, and time-frequency-based prototypes integrating both. These explanations respect acoustic properties, provide greater semantic clarity than standard gradient-based methods, require no fine-tuning of the backbone, and strictly preserve output invariance.

What carries the argument

Disentanglement of acoustic similarity into four prototype perspectives: square-based for transient events, time-based for temporal patterns, frequency-based for spectral bands, and time-frequency-based for integrated analysis.

If this is right

  • Explanations become available for any existing audio classifier without additional training steps.
  • Transient events can be localized while temporal patterns and spectral bands are highlighted separately or together.
  • Model decisions in audio tasks gain semantic clarity that respects the multidimensional nature of sound.
  • Deployment of audio AI systems can proceed with example-based interpretations that maintain original performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The four perspectives might combine into a single interactive interface for users exploring complex audio scenes.
  • Similar disentanglement could apply to other time-series data such as sensor readings or physiological signals.
  • Quantitative metrics for semantic clarity could be developed by measuring human agreement on what the prototypes reveal.

Load-bearing premise

Acoustic similarity can be captured and disentangled into the four proposed prototype perspectives without fine-tuning the backbone model while preserving output invariance.

What would settle it

An experiment on standard audio datasets where the APEX explanations either alter the classifier output or fail to show clearer semantic understanding than gradient-based methods applied to spectrograms.

Figures

Figures reproduced from arXiv: 2605.10153 by Kornel Howil, Mi{\l}osz Adamczyk, Piotr Borycki, Piotr Kawa, Piotr Syga, Przemys{\l}aw Spurek.

Figure 1
Figure 1. Figure 1: Overview of the APEX framework. Unlike traditional prototype-based approaches that require training specialized architectures from scratch, APEX operates in a post-hoc set￾ting, providing interpretability for arbitrary pre-trained audio backbones. The diagram illustrates our four distinct prototype extraction schemes: Square-based, Time-based, Frequency￾based, and Time-Frequency-based. These schemes disent… view at source ↗
Figure 2
Figure 2. Figure 2: Architectural and representational comparison between a standard audio classifier and the post-hoc APEX framework. Top: While a classical backbone produces entangled feature maps, APEX inserts a Disentanglement Module. Applying a learnable invertible transformation U and its inverse U −1 reorganizes the latent space while strictly preserving the original model’s predictions (output invariance). Bottom: The… view at source ↗
Figure 3
Figure 3. Figure 3: Explanations for a Golden-crowned Kinglet (gockin) before and after APEX optimization. Prior to tuning, the extracted prototypes resemble random noise and offer little interpretability. Following APEX optimization, prototypes become semantically meaningful, aligning with the distinct acoustic features present in the input spectrogram. Labels above prototypes show their classes. sic data by operating on sou… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of post-hoc interpretability methods conducted on a representation learned on top of a pretrained ConvNeXt classifier. The input spectrogram (far left) contains distinct correctly classified test sounds from the SNE test set of the BirdSet [26] dataset. APEX framework successfully disentangles the latent space to generate highly localized, semantically clear time-frequency explanatio… view at source ↗
Figure 5
Figure 5. Figure 5: Depiction of the APEX masking strategy used to eval￾uate feature importance. The columns show the original spec￾trogram (left), the APEX explanation with the localized proto￾type region highlighted in green (middle), and the correspond￾ing masked spectrogram (right) for each of the schemes. Optimization via Scheme-Dependent Purity The objec￾tive of the disentanglement process is to maximize the “purity” of… view at source ↗
Figure 6
Figure 6. Figure 6: APEX explanations (ConvNeXt) for correctly classi￾fied real audio from LJSpeech [28]. The left column shows input query spectrograms and heatmaps; the right four show proto￾typical parts and labels. Odd rows display original spectro￾grams, while even rows highlight spectral differences between the real and HiFi-GAN vocoded audio. (Trained on the HiFi￾GAN subset of WaveFake [29]). spectrograms by overlaying… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of explanations between APEX (our) and prototype-based model AudioProtoPNet. The comparison is conducted on a representation learned on top of a pretrained ConvNeXt. This example shows correctly classified test sound of a mouchi (Mountain Chickadee) from the SNE test set of BirdSet [26] dataset. Labels above prototypes show their classes [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. While prototype reasoning is promising, acoustic similarity remains multidimensional. We introduce APEX (Audio Prototype EXplanations), a post-hoc framework for interpreting pre-trained audio classifiers. Crucially, APEX requires no fine-tuning of the original backbone and strictly preserves output invariance. APEX disentangles explanations into four perspectives: Square-based prototypes to localize transient events, Time-based for temporal patterns, Frequency-based highlighting spectral bands, and Time-Frequency-based integrating both. This yields intuitive, example-based explanations that respect acoustic properties, providing greater semantic clarity than standard gradient-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces APEX, a post-hoc framework for interpreting pre-trained audio classifiers. It disentangles explanations into four prototype perspectives—square-based (transient events), time-based (temporal patterns), frequency-based (spectral bands), and time-frequency-based (joint structure)—without fine-tuning the backbone model and while strictly preserving output invariance. The method is positioned as yielding intuitive, example-based explanations that respect acoustic properties and provide greater semantic clarity than standard gradient-based attribution techniques applied to spectrograms.

Significance. If the four-perspective disentanglement can be shown to be faithful to the frozen classifier while respecting acoustic multidimensionality, APEX would address a clear gap in mature XAI methods for audio domains. The post-hoc, no-fine-tuning design and invariance guarantee are strengths that could enable broader adoption for tasks where acoustic similarity is not easily captured by vision-derived gradients.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method): The central claim that prototypes can be discovered post-hoc to separately capture transient events, temporal patterns, spectral bands, and joint time-frequency structure—while strictly preserving output invariance and without any fine-tuning—lacks an explicit algorithm, projection step, or invariance proof. Because the backbone latent space may entangle these acoustic dimensions, it is unclear whether the separation is achieved solely on existing representations or requires implicit optimization that would violate the no-fine-tuning guarantee.
  2. [§4] §4 (Experiments): No quantitative validation, ablation studies, or comparison results against gradient-based methods are described to demonstrate that the four perspectives deliver greater semantic clarity or maintain fidelity to the original model's decisions. The abstract's claim of superior clarity therefore rests on an unverified assumption whose failure would collapse the four-perspective framing.
minor comments (1)
  1. [Abstract] The abstract refers to 'acoustic similarity remains multidimensional' but does not define how the four perspectives are formally distinguished or selected from the representation space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and commit to substantial revisions that will strengthen the technical presentation and empirical support for APEX.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that prototypes can be discovered post-hoc to separately capture transient events, temporal patterns, spectral bands, and joint time-frequency structure—while strictly preserving output invariance and without any fine-tuning—lacks an explicit algorithm, projection step, or invariance proof. Because the backbone latent space may entangle these acoustic dimensions, it is unclear whether the separation is achieved solely on existing representations or requires implicit optimization that would violate the no-fine-tuning guarantee.

    Authors: We acknowledge that the original presentation of the method in §3 could benefit from greater explicitness. APEX operates entirely post-hoc: prototype discovery is performed by optimizing a small set of learnable parameters in the input domain (audio waveforms or spectrograms) while the backbone classifier remains completely frozen. The four perspectives are realized by applying dimension-specific regularization terms and masking operations during this optimization (square masks for transients, temporal slicing for time-based, frequency-band constraints for spectral, and joint 2D constraints for time-frequency). A projection step maps the optimized prototypes back to valid acoustic signals to enforce output invariance. We will add a formal algorithm box, a detailed description of the projection operator, and a short proof that the classifier output is unchanged when the prototype is substituted for the original input. Regarding entanglement, the separation does not require modifying the latent space; it exploits the fact that the pre-trained model already encodes separable acoustic cues, which we isolate via the constrained search rather than through implicit fine-tuning. revision: yes

  2. Referee: [§4] §4 (Experiments): No quantitative validation, ablation studies, or comparison results against gradient-based methods are described to demonstrate that the four perspectives deliver greater semantic clarity or maintain fidelity to the original model's decisions. The abstract's claim of superior clarity therefore rests on an unverified assumption whose failure would collapse the four-perspective framing.

    Authors: We agree that the current experimental section relies primarily on qualitative illustrations and that this is insufficient to substantiate the claims of greater semantic clarity and fidelity. In the revised manuscript we will introduce quantitative evaluations on standard audio classification benchmarks, including: (i) fidelity metrics that measure the change in model output when prototypes are used as explanations, (ii) ablation studies isolating the contribution of each of the four perspectives, and (iii) direct comparisons against gradient-based attribution methods applied to spectrograms using both automated metrics (insertion/deletion AUC) and human-subject ratings of interpretability. These additions will be placed in an expanded §4 and will directly test whether the four-perspective disentanglement improves upon vision-derived baselines while preserving invariance. revision: yes

Circularity Check

0 steps flagged

No circularity: post-hoc framework with no fitted predictions or self-referential reductions

full rationale

The paper presents APEX as a post-hoc, no-fine-tuning method that strictly preserves output invariance while disentangling acoustic similarity into four prototype perspectives. No equations, loss functions, parameter-fitting steps, or self-citations appear in the abstract or described claims that would make any output equivalent to its inputs by construction. The four perspectives are introduced as a design choice for semantic clarity rather than derived from a self-definitional loop or a fitted input renamed as a prediction. The derivation chain remains self-contained and externally falsifiable via the frozen backbone's behavior, with no load-bearing uniqueness theorems or ansatzes smuggled through prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated premise that prototype selection in the four acoustic domains can be performed post-hoc while exactly preserving classifier outputs; no free parameters, axioms, or invented entities are specified in the abstract.

pith-pipeline@v0.9.0 · 5448 in / 948 out tokens · 27691 ms · 2026-05-12T03:02:31.978166+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 3 internal anchors

  1. [1]

    Introduction Neural networks dominate modern audio processing, from deepfake detection to healthcare, often surpassing human per- formance [1, 2]. However, their deployment in safety-critical environments raises significant ethical and legal concerns, par- ticularly under regulations like the AI Act, necessitating robust interpretability. Current explanat...

  2. [2]

    With respect to the training process, methods are typically divided into ante-hoc and post-hoc approaches

    Related work Neural network interpretability methods are commonly cate- gorized along two orthogonal dimensions: the stage at which interpretation is introduced relative to training, and the type of evidence the explanation is designed to represent. With respect to the training process, methods are typically divided into ante-hoc and post-hoc approaches. ...

  3. [3]

    when” and “what

    APEX: Audio Prototype EXplanations for Classification Tasks In this section, we describe the proposed APEX framework (see Fig. 2). APEX is a post-hoc interpretability method de- signed for pre-trained audio classification networks. It provides prototype-based explanations by identifying training samples semantically similar to a given query and highlighti...

  4. [4]

    Experiments We evaluate our approach under two experimental scenarios. In the first scenario, to demonstrate that APEX maintains strict output invariance, we evaluate its classification performance against the vanilla pre-trained ConvNeXt-Base [22] backbone and an AudioProtoPNet trained on the same pre-trained model. In the second scenario, we investigate...

  5. [5]

    APEX goes beyond spectrogram-attribution adaptations and prototype networks trained from scratch by operating on the model’s latent representation with strict output invariance

    Conclusions In this work, we introduced APEX, a post-hoc prototype-based interpretability framework for arbitrary pre-trained audio clas- sifiers. APEX goes beyond spectrogram-attribution adaptations and prototype networks trained from scratch by operating on the model’s latent representation with strict output invariance. We insert a learnable invertible...

  6. [6]

    Human perception of audio deepfakes,

    N. M. Müller, K. Pizzi, and J. Williams, “Human perception of audio deepfakes,” inProceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, ser. DDAM ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 85–91. [Online]. Available: https: //doi.org/10.1145/3552466.3556531

  7. [7]

    Deep learning algorithms to detect murmurs associated with struc- tural heart disease,

    J. Prince, J. Maidens, S. Kieu, C. Currie, D. Barbosa, C. Hitch- cock, A. Saltman, K. Norozi, P. Wiesner, N. Slamonet al., “Deep learning algorithms to detect murmurs associated with struc- tural heart disease,”Journal of the American Heart Association, vol. 12, no. 20, p. e030377, 2023

  8. [8]

    On the reliability of feature attribution methods for speech clas- sification,

    G. Shen, H. Mohebbi, A. Bisazza, A. Alishahi, and G. Chrupala, “On the reliability of feature attribution methods for speech clas- sification,” inInterspeech 2025, 2025, pp. 266–270

  9. [9]

    Audioprotopnet: An interpretable deep learning model for bird sound classifica- tion,

    R. Heinrich, L. Rauch, B. Sick, and C. Scholz, “Audioprotopnet: An interpretable deep learning model for bird sound classifica- tion,”Ecological Informatics, vol. 87, p. 103081, 2025

  10. [10]

    Interpretable all-type audio deepfake detection with audio llms via frequency-time reinforcement learning,

    Y . Xie, X. Guo, J. Zhou, T. Wang, J. Liu, R. Fu, X. Wang, H. Cheng, and L. Ye, “Interpretable all-type audio deepfake detection with audio llms via frequency-time reinforcement learning,” 2026. [Online]. Available: https: //arxiv.org/abs/2601.02983

  11. [11]

    Epic: Explanation of pretrained image classification networks via prototype,

    P. Borycki, M. Tr˛ edowicz, S. Janusz, J. Tabor, P. Spurek, A. Lewicki, and Ł. Struski, “Epic: Explanation of pretrained image classification networks via prototype,”arXiv preprint arXiv:2505.12897, 2025

  12. [12]

    Infodisent: Explainability of image classification models by information disentanglement.arXiv preprint arXiv:2409.10329, 2024

    Ł. Struski, D. Rymarczyk, and J. Tabor, “Infodisent: Explain- ability of image classification models by information disentangle- ment,”arXiv preprint arXiv:2409.10329, 2024

  13. [13]

    Side: Sparse information disentanglement for explainable artificial intelligence.arXiv preprint arXiv:2507.19321, 2025

    V . Dubovik, Łukasz Struski, J. Tabor, and D. Rymarczyk, “Side: Sparse information disentanglement for explainable artificial intelligence,” 2025. [Online]. Available: https://arxiv.org/abs/ 2507.19321

  14. [14]

    C. Chen, O. Li, C. Tao, A. J. Barnett, J. Su, and C. Rudin,This looks like that: deep learning for interpretable image recognition. Red Hook, NY , USA: Curran Associates Inc., 2019

  15. [15]

    Pip-net: Patch-based intuitive prototypes for interpretable image classifi- cation,

    M. Nauta, J. Schlötterer, M. van Keulen, and C. Seifert, “Pip-net: Patch-based intuitive prototypes for interpretable image classifi- cation,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2744–2753

  16. [16]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” in2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2014. [Online]. Available: http://arxiv.org/abs/...

  17. [17]

    Grad-cam: Visual explanations from deep net- works via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep net- works via gradient-based localization,” in2017 IEEE Interna- tional Conference on Computer Vision (ICCV), 2017

  18. [18]

    Grad-cam++: Generalized gradient-based visual expla- nations for deep convolutional networks,

    A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubra- manian, “Grad-cam++: Generalized gradient-based visual expla- nations for deep convolutional networks,” in2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018

  19. [19]

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLoS ONE, 10(7):e0130140, July 2015

    S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,”PLOS ONE, vol. 10, no. 7, pp. 1–46, 07 2015. [Online]. Available: https://doi.org/10.1371/journal.pone.0130140

  20. [20]

    Why should I trust you?

    M. T. Ribeiro, S. Singh, and C. Guestrin, “"why should i trust you?": Explaining the predictions of any classifier,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. New York, NY , USA: Association for Computing Machinery, 2016, p. 1135–1144. [Online]. Available: https: //doi.org/10.1145/2...

  21. [21]

    A unified approach to inter- preting model predictions,

    S. M. Lundberg and S.-I. Lee, “A unified approach to inter- preting model predictions,” inProceedings of the 31st Interna- tional Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 4768–4777

  22. [22]

    Phoneme Discretized Saliency Maps for Explainable Detection of AI- Generated V oice,

    S. Gupta, M. Ravanelli, P. Germain, and C. Subakan, “Phoneme Discretized Saliency Maps for Explainable Detection of AI- Generated V oice,” inInterspeech 2024, 2024, pp. 3295–3299

  23. [23]

    Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,

    J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryan, R. A. Saurous, Y . Agiomvrgiannakis, and Y . Wu, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press, 2018, p. 4779–4...

  24. [24]

    Fastspeech 2: Fast and high-quality end-to-end text to speech,

    Y . Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y . Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview. net/forum?id=piLPYqxtWuA

  25. [25]

    Toward robust real-world audio deepfake detection: Closing the explainability gap,

    G. Channing, J. Sock, R. Clark, P. Torr, and C. S. de Witt, “Toward robust real-world audio deepfake detection: Closing the explainability gap,” 2024. [Online]. Available: https://arxiv.org/abs/2410.07436

  26. [26]

    audiolime: Lis- tenable explanations using source separation,

    V . Haunschmid, E. Manilow, and G. Widmer, “audiolime: Lis- tenable explanations using source separation,” inProceedings of the 13th International Workshop on Machine Learning and Music (MLM), 2020

  27. [27]

    A convnet for the 2020s,

    Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,”Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2022

  28. [28]

    This sounds like that: Explainable audio classification via prototypical parts,

    A. Fedele, R. Guidotti, and D. Pedreschi, “This sounds like that: Explainable audio classification via prototypical parts,” inInter- national Conference on Discovery Science. Springer, 2024, pp. 348–363

  29. [29]

    And: Audio network dis- section for interpreting deep acoustic models,

    T.-Y . Wu, Y .-X. Lin, and T.-W. Weng, “And: Audio network dis- section for interpreting deep acoustic models,” inProceedings of International Conference on Machine Learning (ICML), 2024

  30. [30]

    A Data- Driven Diffusion-based Approach for Audio Deepfake Explana- tions,

    P. Grinberg, A. Kumar, S. Koppisetti, and G. Bharaj, “A Data- Driven Diffusion-based Approach for Audio Deepfake Explana- tions,” inInterspeech 2025, 2025, pp. 5348–5352

  31. [31]

    Birdset: A large-scale dataset for audio classification in avian bioacoustics,

    L. Rauch, R. Schwinger, M. Wirth, R. Heinrich, D. Huseljic, M. Herde, J. Lange, S. Kahl, B. Sick, S. Tomforde, and C. Scholz, “Birdset: A large-scale dataset for audio classification in avian bioacoustics,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https: //openreview.net/forum?id=dRXxFEY8ZE

  32. [32]

    R. A. Horn and C. R. Johnson,Matrix Analysis. Cambridge University Press, 1985

  33. [33]

    The LJ Speech Dataset,

    K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito. com/LJ-Speech-Dataset/, 2017

  34. [34]

    WaveFake: A Data Set to Facilitate Audio Deepfake Detection,

    J. Frank and L. Schönherr, “WaveFake: A Data Set to Facilitate Audio Deepfake Detection,” in35th Conference on Neural In- formation Processing Systems Datasets and Benchmarks Track, 2021

  35. [35]

    Learning deep features for discriminative localization,

    B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” inPro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929

  36. [36]

    Investigation of Sub-Band Discriminative Information Between Spoofed and Genuine Speech,

    K. Sriskandaraja, V . Sethu, P. N. Le, and E. Ambikairajah, “Investigation of Sub-Band Discriminative Information Between Spoofed and Genuine Speech,” inInterspeech 2016, 2016, pp. 1710–1714

  37. [37]

    Audio Replay Attack Detection Using High-Frequency Features,

    M. Witkowski, S. Kacprzak, P. ˙Zelasko, K. Kowalczyk, and J. Gałka, “Audio Replay Attack Detection Using High-Frequency Features,” inInterspeech 2017, 2017, pp. 27–31

  38. [38]

    Environmen- tal sound classification using temporal-frequency attention based convolutional neural network,

    W. Mu, B. Yin, X. Huang, J. Xu, and Z. Du, “Environmen- tal sound classification using temporal-frequency attention based convolutional neural network,”Scientific Reports, vol. 11, no. 1, p. 21552, 2021

  39. [39]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inInternational Conference on Learning Representations (ICLR), 2015. [Online]. Available: https://arxiv.org/abs/1412.6980

  40. [40]

    JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

    R. Sonobe, S. Takamichi, and H. Saruwatari, “Jsut corpus: free large-scale japanese speech corpus for end-to-end speech synthe- sis,”arXiv preprint arXiv:1711.00354, 2017

  41. [41]

    Hifi-gan: Generative adversarial net- works for efficient and high fidelity speech synthesis,

    J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial net- works for efficient and high fidelity speech synthesis,”Advances in Neural Information Processing Systems, vol. 33, 2020

  42. [42]

    Waveglow: A flow-based generative network for speech synthesis,

    R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” inICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 3617–3621

  43. [43]

    Mel- gan: Generative adversarial networks for conditional waveform synthesis,

    K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y . Bengio, and A. C. Courville, “Mel- gan: Generative adversarial networks for conditional waveform synthesis,”Advances in neural information processing systems, vol. 32, 2019

  44. [44]

    Multi- band melgan: Faster waveform generation for high-quality text- to-speech,

    G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie, “Multi- band melgan: Faster waveform generation for high-quality text- to-speech,” in2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 492–498

  45. [45]

    Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,

    R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” inICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6199–6203

  46. [46]

    wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

  47. [47]

    Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021