arxiv: 2604.17642 · v1 · submitted 2026-04-19 · 📡 eess.AS

Recognition: unknown

HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

Mohd Mujtaba Akhtar , Girish , Muskaan Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:48 UTC · model grok-4.3

classification 📡 eess.AS

keywords audio deepfake detectionpathological speechcodec fakeshyperbolic spacehealthcare benchmarkneural audio codecspeech synthesisPHOENIX-Mamba

0 comments

The pith

PHOENIX-Mamba detects codec deepfakes in pathological speech at over 93 percent accuracy by clustering modes in hyperbolic space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the HCFD task and releases the first pathology-aware dataset of paired real and NAC-synthesized speech across clinical conditions and codec families. It demonstrates that existing state-of-the-art detectors trained on healthy speech drop in performance when faced with diseased voices, while PaSST already improves results through its patch-based spectro-temporal features. The core contribution is PHOENIX-Mamba, a geometry-aware model that treats codec fakes as self-discovered clusters in hyperbolic space and reaches the highest reported accuracies across English and Chinese data for depression, Alzheimer's, and dysarthria. If the claim holds, healthcare voice systems will require condition-specific detectors rather than generic ones to maintain reliability against synthetic speech.

Core claim

PHOENIX-Mamba models codec-fakes as multiple self-discovered modes in hyperbolic space, enabling robust discrimination under pathological speech variability and achieving the strongest performance on the HCFD task across clinical conditions and codecs, with 97.04 accuracy on E-Dep, 96.73 on E-Alz, 96.57 on E-Dys, and 94.41, 94.40, 93.20 on the corresponding Chinese sets.

What carries the argument

PHOENIX-Mamba, a geometry-aware framework that models codec-fakes as multiple self-discovered modes in hyperbolic space to facilitate clustering of heterogeneous fake types.

If this is right

Existing deepfake detectors must incorporate pathology-specific training data to avoid failure in medical voice applications.
Hyperbolic geometry offers a structured way to separate synthesis artifacts from disease-induced voice changes without manual mode labels.
The released Healthcare CodecFake dataset supports systematic benchmarking and iterative improvement of detectors across multiple conditions and codecs.
Self-discovered clustering in curved space reduces reliance on predefined categories of fakes during model development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hyperbolic clustering technique could be tested on non-codec synthesis methods to see whether it generalizes beyond the current focus.
Clinical voice monitoring tools might integrate this approach to flag synthetic inputs in telemedicine or diagnostic recordings.
The interaction between specific pathologies and codec artifacts revealed by the clustering could guide the creation of more targeted synthetic datasets.

Load-bearing premise

The NAC-synthesized pathological speech samples in the released dataset accurately represent the distribution and variability of real-world codec-based deepfakes encountered in clinical healthcare settings.

What would settle it

Performance of PHOENIX-Mamba on a fresh collection of actual clinical recordings containing codec-synthesized pathological speech would drop sharply below the reported levels if the synthetic data does not match real distributions.

Figures

Figures reproduced from arXiv: 2604.17642 by Girish, Mohd Mujtaba Akhtar, Muskaan Singh.

**Figure 1.** Figure 1: Proposed Framework: PHOENIX-Mamba perbolic prototypes. Method Dep Alz Dys Acc↑ F1↑ Acc↑ F1↑ Acc↑ F1↑ English AASIST (Tr. on CF) 48.62 44.03 34.19 32.51 36.71 34.39 AASIST (Tr. in-domain) 60.84 57.92 52.14 49.93 56.07 54.49 AASIST (wav2vec2.0) 63.55 51.29 57.76 54.98 59.35 57.16 RawNet2 60.46 50.27 58.96 53.87 57.24 54.68 LCNN 62.79 53.75 60.59 55.27 59.79 55.62 SAMO 64.37 54.78 61.89 58.41 60.27 56.73 Meth… view at source ↗

**Figure 2.** Figure 2: Confusion matrices for selected PHOENIX-Mamba configurations: (a) Depression with PaSST on Chinese; (b) Depression with Wav2vec 2.0 on English; (c) Dysarthria with PaSST on Chinese; (d) Alzheimer’s with PaSST on Chinese; (e) Alzheimer’s with Whisper on Chinese; and (f) Alzheimer’s with WavLM on Chinese. These plots summarize prediction stability and highlight the dominant error modes for each configuration… view at source ↗

**Figure 3.** Figure 3: t-SNE visualizations: (a) Depression with PaSST on English [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

read the original abstract

In this study, we present Healthcare Codec-Fake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We intentionally focus on codec based synthetic speech in this work, since neural codec decoding forms a core building block in modern speech generation pipelines. First, we release Healthcare CodecFake, the first pathology-aware dataset containing paired real and NAC-synthesized speech across multipl clinical conditions and codec families. Our evaluations show that SOTA codec-fake detectors trained primarily on healthy speech perform poorly on Healthcare CodecFake, highlighting the need for HCFD-specific models. Second, we demonstrate that PaSST outperforms existing speech-based models for HCFD, benefiting from its patch-based spectro-temporal representation. Finally, we propose PHOENIX-Mamba, a geometry-aware framework that models codec-fakes as multiple self-discovered modes in hyperbolic space and achieves the strongest performance on HCFD across clinical conditions and codecs. Experiments on HCFK show that PHOENIX-Mamba (PaSST) achieves the best overall performance, reaching 97.04 Acc on E-Dep, 96.73 on E-Alz, and 96.57 on E-Dys, while maintaining strong results on Chinese with 94.41 (Dep), 94.40 (Alz), and 93.20 (Dys). This geometry-aware formulation enables self-discovered clustering of heterogeneous codec-fake modes in hyperbolic space, facilitating robust discrimination under pathological speech variability. PHOENIX-Mamba achieves topmost performance on the HCFD task across clinical conditions and codecs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper releases a pathology-specific codec deepfake dataset that existing detectors fail on and pairs it with a hyperbolic model reporting high accuracies, but the clinical realism of the synthetic data remains untested.

read the letter

The punchline is that this work carves out a narrow but concrete niche: it shows standard audio deepfake detectors trained on healthy speech drop sharply when faced with pathological conditions, and it supplies the first paired dataset to measure that gap. The Healthcare CodecFake release covers real speech from depression, Alzheimer's, and dysarthria cases alongside NAC-synthesized fakes across multiple codecs, which is a tangible addition that others can now use for testing. The PHOENIX-Mamba model, built on PaSST with hyperbolic geometry to cluster fake modes, reaches the reported 97% range on the English subsets and stays above 93% on the Chinese ones, outperforming the baselines they tried. That geometry angle is a reasonable attempt to handle the heterogeneity introduced by pathology rather than treating all fakes as one class. The paper does a service by documenting the performance drop on off-the-shelf models, which is the kind of negative result that helps set expectations for deployment in voice-enabled medical tools. The soft spots sit mainly with the dataset construction and evaluation. The NAC synthesis is presented as a proxy for real codec-based deepfakes, yet nothing in the abstract or stress-test note indicates external checks against actual clinical incidents, clinician listening tests, or statistical comparison of spectral and modulation artifacts under dysarthria. If those synthetic samples differ systematically from deployed pipelines, the high accuracies and the claimed self-clustering benefit stay benchmark-specific. Training and testing on the same newly created set also raises the usual circularity question until cross-dataset or out-of-distribution numbers appear. Details on baseline re-implementations, training protocols, and significance testing are not visible here, so those will need to be verified in the full text. This is for people already working on audio forensics or healthcare voice security who need a starting point for pathology-aware evaluation. It is not reshaping the broader field but supplies a usable resource. The dataset alone is enough to justify sending it to peer review rather than desk rejection; referees can pressure the authors on validation and generalization while the community gains from the released pairs.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the HCFD task for detecting codec-based deepfakes under pathological speech conditions. It releases the Healthcare CodecFake dataset pairing real pathological speech (depression, Alzheimer's, dysarthria) with NAC-synthesized fakes across English and Chinese, shows that existing SOTA detectors trained on healthy speech perform poorly, and proposes PHOENIX-Mamba (a PaSST-based model using hyperbolic geometry to self-discover multiple codec-fake modes), reporting peak accuracies of 97.04% on E-Dep, 96.73% on E-Alz, and 96.57% on E-Dys.

Significance. The dataset release addresses a genuine gap in pathology-aware deepfake benchmarks and could enable more robust healthcare audio security tools if the representativeness claim holds. The hyperbolic formulation offers a plausible mechanism for handling heterogeneous fake modes, but its added value over standard spectro-temporal models remains to be demonstrated beyond the new benchmark.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The reported accuracies (97.04 Acc on E-Dep etc.) and superiority claims are presented without any description of training procedures, baseline re-implementations, data splits, cross-validation, number of runs, or statistical tests; this directly undermines the ability to verify that the numbers support the central claim of SOTA performance on HCFD.
[§3] §3 (Dataset): The core premise that NAC-synthesized pathological samples accurately represent real-world clinical codec deepfakes is stated without supporting evidence such as perceptual validation by clinicians, spectral/temporal distribution comparisons, or tests against actual deployed healthcare synthesis pipelines; if NAC artifacts differ systematically under dysarthria or other conditions, the benchmark results and the claimed clinical relevance become dataset-specific.
[§5] §5 (PHOENIX-Mamba): The claim that the hyperbolic geometry enables 'self-discovered clustering of heterogeneous codec-fake modes' is load-bearing for the proposed method's novelty, yet no ablation removing the hyperbolic component, no quantitative clustering metrics (e.g., silhouette score), and no visualizations of the embedding space are referenced to isolate its contribution over plain PaSST.

minor comments (2)

[Abstract] Abstract: 'Experiments on HCFK' appears to be a typo for the dataset name Healthcare CodecFake or the HCFD task.
[Abstract] Notation: The distinction between 'E-Dep/E-Alz/E-Dys' (English) and the Chinese conditions is not explicitly defined on first use, complicating readability of the performance table.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback. We will revise the manuscript to address the major concerns by providing missing experimental details, additional validation for the dataset, and ablations for the proposed method. Our point-by-point responses are as follows.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The reported accuracies (97.04 Acc on E-Dep etc.) and superiority claims are presented without any description of training procedures, baseline re-implementations, data splits, cross-validation, number of runs, or statistical tests; this directly undermines the ability to verify that the numbers support the central claim of SOTA performance on HCFD.

Authors: We agree that the experimental setup details were not sufficiently elaborated in the manuscript. In the revised version, we will expand §4 to include full descriptions of: training procedures and hyperparameters for all models; baseline re-implementations including any modifications; data splits with speaker independence considerations; cross-validation approach; results averaged over multiple runs including variability; and statistical tests to support the superiority claims. These additions will enable verification of the SOTA performance on HCFD. revision: yes
Referee: [§3] §3 (Dataset): The core premise that NAC-synthesized pathological samples accurately represent real-world clinical codec deepfakes is stated without supporting evidence such as perceptual validation by clinicians, spectral/temporal distribution comparisons, or tests against actual deployed healthcare synthesis pipelines; if NAC artifacts differ systematically under dysarthria or other conditions, the benchmark results and the claimed clinical relevance become dataset-specific.

Authors: We acknowledge this as a valid concern. NAC is used as it is a foundational component in contemporary speech generation pipelines. In the revision, we will incorporate spectral and temporal distribution comparisons between real and NAC-synthesized samples. We will also add a limitations section explicitly stating that perceptual validation by clinicians and tests on deployed clinical pipelines were not performed in this work, positioning the dataset as an initial benchmark for pathology-aware detection rather than a definitive real-world proxy. This will clarify the scope and clinical relevance of the results. revision: partial
Referee: [§5] §5 (PHOENIX-Mamba): The claim that the hyperbolic geometry enables 'self-discovered clustering of heterogeneous codec-fake modes' is load-bearing for the proposed method's novelty, yet no ablation removing the hyperbolic component, no quantitative clustering metrics (e.g., silhouette score), and no visualizations of the embedding space are referenced to isolate its contribution over plain PaSST.

Authors: We agree that further evidence is required to support the novelty of the hyperbolic component. In the revised §5, we will add an ablation study removing the hyperbolic geometry to quantify its impact on performance across conditions. Additionally, we will provide quantitative clustering metrics (such as silhouette scores) and visualizations of the embedding spaces to demonstrate the self-discovered modes and their separation in hyperbolic space compared to Euclidean alternatives. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results on newly introduced dataset

full rationale

The paper introduces a new task (HCFD) and releases a new paired dataset (Healthcare CodecFake) of real pathological speech and NAC-synthesized fakes. It then trains and evaluates models (including the proposed PHOENIX-Mamba) on splits of this dataset, reporting accuracies such as 97.04 on E-Dep. This is standard benchmark practice and does not reduce any claimed result to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation. No equations are presented that equate the performance metric to its own inputs by construction. The geometry-aware hyperbolic clustering is described as an empirical modeling choice whose benefit is demonstrated on the data, without being tautological. The paper is self-contained against its own external benchmark and contains independent content in dataset construction and model design.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, so free parameters, axioms, and invented entities cannot be audited in detail. The central claim rests on the assumption that the new dataset captures real codec-fake challenges and that hyperbolic geometry provides an advantage, but no explicit parameters or background axioms are stated.

invented entities (1)

PHOENIX-Mamba no independent evidence
purpose: Geometry-aware framework modeling codec-fakes as self-discovered modes in hyperbolic space for robust discrimination under pathological variability
New model proposed to achieve top performance on the HCFD task.

pith-pipeline@v0.9.0 · 5587 in / 1337 out tokens · 64492 ms · 2026-05-10T04:48:47.595826+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 21 canonical work pages · 2 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Transactions on Machine Learning Research , issn=

High Fidelity Neural Audio Compression , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

2023
[9]

Advances in Neural Information Processing Systems , volume=

High-fidelity audio compression with improved rvqgan , author=. Advances in Neural Information Processing Systems , volume=
[10]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Soundstream: An end-to-end neural audio codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2021 , publisher=

2021
[11]

arXiv preprint arXiv:2506.12627 , year=

Towards Neural Audio Codec Source Parsing , author=. arXiv preprint arXiv:2506.12627 , year=

work page arXiv
[12]

Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation , year=

SNAC: Multi-Scale Neural Audio Codec , author=. Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation , year=

2024
[13]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024
[14]

The Twelfth International Conference on Learning Representations , year=

SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[15]

High Fidelity Neural Audio Compression

High fidelity neural audio compression , author=. arXiv preprint arXiv:2210.13438 , year=

work page internal anchor Pith review arXiv
[16]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Audiodec: An open-source streaming high-fidelity neural audio codec , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

2023
[17]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024
[18]

The Twelfth International Conference on Learning Representations , year=

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis , author=. The Twelfth International Conference on Learning Representations , year=
[19]

2024 , booktitle =

CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems , author =. 2024 , booktitle =. doi:10.21437/Interspeech.2024-2093 , issn =

work page doi:10.21437/interspeech.2024-2093 2024
[20]

2024 , booktitle =

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio , author =. 2024 , booktitle =. doi:10.21437/Interspeech.2024-2428 , issn =

work page doi:10.21437/interspeech.2024-2428 2024
[21]

IEEE Transactions on Audio, Speech and Language Processing , year=

The codecfake dataset and countermeasures for the universally detection of deepfake audio , author=. IEEE Transactions on Audio, Speech and Language Processing , year=
[22]

Codecfake+: A large-scale neu- ral audio codec-based deepfake speech dataset,

CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset , author=. arXiv preprint arXiv:2501.08238 , year=

work page arXiv
[23]

Towards generalized source tracing for codec-based deepfake speech,

Towards Generalized Source Tracing for Codec-Based Deepfake Speech , author=. arXiv preprint arXiv:2506.07294 , year=

work page arXiv
[24]

2025 , eprint=

Measuring the Robustness of Audio Deepfake Detectors , author=. 2025 , eprint=

2025
[25]

2022 , url =

Neil Zeghidour and Alejandro Luebs and Ahmed Omran and Jan Skoglund and Marco Tagliasacchi , title =. 2022 , url =

2022
[26]

High Fidelity Neural Audio Compression , journal =

Alexandre D. High Fidelity Neural Audio Compression , journal =. 2023 , url =

2023
[27]

AudioLM:

Zal. AudioLM:. 2023 , url =

2023
[28]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , author =. 2023 , eprint =. doi:10.48550/arXiv.2301.02111 , url =

work page internal anchor Pith review doi:10.48550/arxiv.2301.02111 2023
[29]

2024 , booktitle =

ASVspoof 5: crowdsourced speech data, deepfakes, and adversarial attacks at scale , author =. 2024 , booktitle =

2024
[30]

2021 , booktitle =

ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection , author =. 2021 , booktitle =

2021
[31]

doi:10.21437/Interspeech.2025-1442 , issn =

Neta Glazer and David Chernin and Idan Achituve and Sharon Gannot and Ethan Fetaya , year =. doi:10.21437/Interspeech.2025-1442 , issn =

work page doi:10.21437/interspeech.2025-1442 2025
[32]

What You Read Isn ' t What You Hear: Linguistic Sensitivity in Deepfake Speech Detection

Nguyen, Binh and Shi, Shuju and Ofman, Ryan and Le, Thai. What You Read Isn ' t What You Hear: Linguistic Sensitivity in Deepfake Speech Detection. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.794

work page doi:10.18653/v1/2025.emnlp-main.794 2025
[33]

The Distress Analysis Interview Corpus of human and computer interviews

Gratch, Jonathan and Artstein, Ron and Lucas, Gale and Stratou, Giota and Scherer, Stefan and Nazarian, Angela and Wood, Rachel and Boberg, Jill and DeVault, David and Marsella, Stacy and Traum, David and Rizzo, Skip and Morency, Louis-Philippe. The Distress Analysis Interview Corpus of human and computer interviews. Proceedings of the Ninth International...

2014
[34]

Automatic Depression Detection: an Emotional Audio-Textual Corpus and A Gru/Bilstm-Based Model , year=

Shen, Ying and Yang, Huiyu and Lin, Lin , booktitle=. Automatic Depression Detection: an Emotional Audio-Textual Corpus and A Gru/Bilstm-Based Model , year=
[35]

2021 , booktitle =

EasyCall Corpus: A Dysarthric Speech Dataset , author =. 2021 , booktitle =. doi:10.21437/Interspeech.2021-549 , issn =

work page doi:10.21437/interspeech.2021-549 2021
[36]

doi:10.21437/Interspeech.2024-1597 , issn =

Yan Wan and Mengyi Sun and Xinchen Kang and Jingting Li and Pengfei Guo and Ming Gao and Su-Jing Wang , year =. doi:10.21437/Interspeech.2024-1597 , issn =

work page doi:10.21437/interspeech.2024-1597 2024
[37]

Rudzicz, Frank and Namasivayam, Aravind Kumar and Wolff, Talya , title =. Lang. Resour. Eval. , month = dec, pages =. 2012 , issue_date =. doi:10.1007/s10579-011-9145-0 , abstract =

work page doi:10.1007/s10579-011-9145-0 2012
[38]

2020 , booktitle =

Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge , author =. 2020 , booktitle =. doi:10.21437/Interspeech.2020-2571 , issn =

work page doi:10.21437/interspeech.2020-2571 2020
[39]

(eds.) Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13-18, 2025

Shakeri, Arezo and Farmanbar, Mina and Balog, Krisztian , title =. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2025 , isbn =. doi:10.1145/3726302.3730313 , abstract =

work page doi:10.1145/3726302.3730313 2025
[40]

IEEE/ACM transactions on audio, speech, and language processing , volume=

Audiolm: a language modeling approach to audio generation , author=. IEEE/ACM transactions on audio, speech, and language processing , volume=. 2023 , publisher=

2023
[41]

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

X-Vectors: Robust DNN Embeddings for Speaker Recognition , author=. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

2018
[42]

The codecfake dataset and countermeasures for the universally detection of deepfake audio.IEEE Transactions on Audio, Speech and Language Processing, 2025a

Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition , author=. arXiv preprint arXiv:2501.06514 , year=

work page arXiv
[43]

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training , author=. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

2022
[44]

Advances in neural information processing systems , volume=

wav2vec 2.0: A framework for self-supervised learning of speech representations , author=. Advances in neural information processing systems , volume=
[45]

IEEE Journal of Selected Topics in Signal Processing , volume=

Wavlm: Large-scale self-supervised pre-training for full stack speech processing , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2022 , publisher=

2022
[46]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[47]

2022 , booktitle =

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , author =. 2022 , booktitle =. doi:10.21437/Interspeech.2022-143 , issn =

work page doi:10.21437/interspeech.2022-143 2022
[48]

AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks , year=

Jung, Jee-weon and Heo, Hee-Soo and Tak, Hemlata and Shim, Hye-jin and Chung, Joon Son and Lee, Bong-Jin and Yu, Ha-Jin and Evans, Nicholas , booktitle=. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks , year=
[49]

Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models , author =. Proc. INTERSPEECH , year =. doi:10.21437/Interspeech.2025-2171 , url =

work page doi:10.21437/interspeech.2025-2171 2025
[50]

Using Voice Biometric Authentication for Patient Privacy , howpublished =
[51]

AI Voice Deepfakes: Waging War on Healthcare , howpublished =
[52]

Voice Biometrics in Health Care , howpublished =
[53]

The Critical Role of Voice Biometrics in Healthcare Security , howpublished =
[54]

Bridging Attribution and Open-Set Detection using Graph-Augmented Instance Learning in Synthetic Speech

Akhtar, Mohd Mujtaba and Girish and Sheth, Farhan and Singh, Muskaan. Bridging Attribution and Open-Set Detection using Graph-Augmented Instance Learning in Synthetic Speech. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.250

work page doi:10.18653/v1/2026.eacl-long.250 2026
[55]

Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces

Sheth, Farhan and Girish and Akhtar, Mohd Mujtaba and Singh, Muskaan. Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computa...

work page doi:10.18653/v1/2025.ijcnlp-long.104 2025
[56]

End-to-End anti-spoofing with RawNet2 , year=

Tak, Hemlata and Patino, Jose and Todisco, Massimiliano and Nautsch, Andreas and Evans, Nicholas and Larcher, Anthony , booktitle=. End-to-End anti-spoofing with RawNet2 , year=
[57]

doi:10.21437/Interspeech.2020-1810 , issn =

Zhenzong Wu and Rohan Kumar Das and Jichen Yang and Haizhou Li , year =. doi:10.21437/Interspeech.2020-1810 , issn =

work page doi:10.21437/interspeech.2020-1810 2020
[58]

SAMO: Speaker Attractor Multi-Center One-Class Learning For Voice Anti-Spoofing , year=

Ding, Siwen and Zhang, You and Duan, Zhiyao , booktitle=. SAMO: Speaker Attractor Multi-Center One-Class Learning For Voice Anti-Spoofing , year=