Recognition: unknown
HCFD: A Benchmark for Audio Deepfake Detection in Healthcare
Pith reviewed 2026-05-10 04:48 UTC · model grok-4.3
The pith
PHOENIX-Mamba detects codec deepfakes in pathological speech at over 93 percent accuracy by clustering modes in hyperbolic space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PHOENIX-Mamba models codec-fakes as multiple self-discovered modes in hyperbolic space, enabling robust discrimination under pathological speech variability and achieving the strongest performance on the HCFD task across clinical conditions and codecs, with 97.04 accuracy on E-Dep, 96.73 on E-Alz, 96.57 on E-Dys, and 94.41, 94.40, 93.20 on the corresponding Chinese sets.
What carries the argument
PHOENIX-Mamba, a geometry-aware framework that models codec-fakes as multiple self-discovered modes in hyperbolic space to facilitate clustering of heterogeneous fake types.
If this is right
- Existing deepfake detectors must incorporate pathology-specific training data to avoid failure in medical voice applications.
- Hyperbolic geometry offers a structured way to separate synthesis artifacts from disease-induced voice changes without manual mode labels.
- The released Healthcare CodecFake dataset supports systematic benchmarking and iterative improvement of detectors across multiple conditions and codecs.
- Self-discovered clustering in curved space reduces reliance on predefined categories of fakes during model development.
Where Pith is reading between the lines
- The same hyperbolic clustering technique could be tested on non-codec synthesis methods to see whether it generalizes beyond the current focus.
- Clinical voice monitoring tools might integrate this approach to flag synthetic inputs in telemedicine or diagnostic recordings.
- The interaction between specific pathologies and codec artifacts revealed by the clustering could guide the creation of more targeted synthetic datasets.
Load-bearing premise
The NAC-synthesized pathological speech samples in the released dataset accurately represent the distribution and variability of real-world codec-based deepfakes encountered in clinical healthcare settings.
What would settle it
Performance of PHOENIX-Mamba on a fresh collection of actual clinical recordings containing codec-synthesized pathological speech would drop sharply below the reported levels if the synthetic data does not match real distributions.
Figures
read the original abstract
In this study, we present Healthcare Codec-Fake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We intentionally focus on codec based synthetic speech in this work, since neural codec decoding forms a core building block in modern speech generation pipelines. First, we release Healthcare CodecFake, the first pathology-aware dataset containing paired real and NAC-synthesized speech across multipl clinical conditions and codec families. Our evaluations show that SOTA codec-fake detectors trained primarily on healthy speech perform poorly on Healthcare CodecFake, highlighting the need for HCFD-specific models. Second, we demonstrate that PaSST outperforms existing speech-based models for HCFD, benefiting from its patch-based spectro-temporal representation. Finally, we propose PHOENIX-Mamba, a geometry-aware framework that models codec-fakes as multiple self-discovered modes in hyperbolic space and achieves the strongest performance on HCFD across clinical conditions and codecs. Experiments on HCFK show that PHOENIX-Mamba (PaSST) achieves the best overall performance, reaching 97.04 Acc on E-Dep, 96.73 on E-Alz, and 96.57 on E-Dys, while maintaining strong results on Chinese with 94.41 (Dep), 94.40 (Alz), and 93.20 (Dys). This geometry-aware formulation enables self-discovered clustering of heterogeneous codec-fake modes in hyperbolic space, facilitating robust discrimination under pathological speech variability. PHOENIX-Mamba achieves topmost performance on the HCFD task across clinical conditions and codecs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the HCFD task for detecting codec-based deepfakes under pathological speech conditions. It releases the Healthcare CodecFake dataset pairing real pathological speech (depression, Alzheimer's, dysarthria) with NAC-synthesized fakes across English and Chinese, shows that existing SOTA detectors trained on healthy speech perform poorly, and proposes PHOENIX-Mamba (a PaSST-based model using hyperbolic geometry to self-discover multiple codec-fake modes), reporting peak accuracies of 97.04% on E-Dep, 96.73% on E-Alz, and 96.57% on E-Dys.
Significance. The dataset release addresses a genuine gap in pathology-aware deepfake benchmarks and could enable more robust healthcare audio security tools if the representativeness claim holds. The hyperbolic formulation offers a plausible mechanism for handling heterogeneous fake modes, but its added value over standard spectro-temporal models remains to be demonstrated beyond the new benchmark.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The reported accuracies (97.04 Acc on E-Dep etc.) and superiority claims are presented without any description of training procedures, baseline re-implementations, data splits, cross-validation, number of runs, or statistical tests; this directly undermines the ability to verify that the numbers support the central claim of SOTA performance on HCFD.
- [§3] §3 (Dataset): The core premise that NAC-synthesized pathological samples accurately represent real-world clinical codec deepfakes is stated without supporting evidence such as perceptual validation by clinicians, spectral/temporal distribution comparisons, or tests against actual deployed healthcare synthesis pipelines; if NAC artifacts differ systematically under dysarthria or other conditions, the benchmark results and the claimed clinical relevance become dataset-specific.
- [§5] §5 (PHOENIX-Mamba): The claim that the hyperbolic geometry enables 'self-discovered clustering of heterogeneous codec-fake modes' is load-bearing for the proposed method's novelty, yet no ablation removing the hyperbolic component, no quantitative clustering metrics (e.g., silhouette score), and no visualizations of the embedding space are referenced to isolate its contribution over plain PaSST.
minor comments (2)
- [Abstract] Abstract: 'Experiments on HCFK' appears to be a typo for the dataset name Healthcare CodecFake or the HCFD task.
- [Abstract] Notation: The distinction between 'E-Dep/E-Alz/E-Dys' (English) and the Chinese conditions is not explicitly defined on first use, complicating readability of the performance table.
Simulated Author's Rebuttal
We thank the referee for the thorough review and constructive feedback. We will revise the manuscript to address the major concerns by providing missing experimental details, additional validation for the dataset, and ablations for the proposed method. Our point-by-point responses are as follows.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The reported accuracies (97.04 Acc on E-Dep etc.) and superiority claims are presented without any description of training procedures, baseline re-implementations, data splits, cross-validation, number of runs, or statistical tests; this directly undermines the ability to verify that the numbers support the central claim of SOTA performance on HCFD.
Authors: We agree that the experimental setup details were not sufficiently elaborated in the manuscript. In the revised version, we will expand §4 to include full descriptions of: training procedures and hyperparameters for all models; baseline re-implementations including any modifications; data splits with speaker independence considerations; cross-validation approach; results averaged over multiple runs including variability; and statistical tests to support the superiority claims. These additions will enable verification of the SOTA performance on HCFD. revision: yes
-
Referee: [§3] §3 (Dataset): The core premise that NAC-synthesized pathological samples accurately represent real-world clinical codec deepfakes is stated without supporting evidence such as perceptual validation by clinicians, spectral/temporal distribution comparisons, or tests against actual deployed healthcare synthesis pipelines; if NAC artifacts differ systematically under dysarthria or other conditions, the benchmark results and the claimed clinical relevance become dataset-specific.
Authors: We acknowledge this as a valid concern. NAC is used as it is a foundational component in contemporary speech generation pipelines. In the revision, we will incorporate spectral and temporal distribution comparisons between real and NAC-synthesized samples. We will also add a limitations section explicitly stating that perceptual validation by clinicians and tests on deployed clinical pipelines were not performed in this work, positioning the dataset as an initial benchmark for pathology-aware detection rather than a definitive real-world proxy. This will clarify the scope and clinical relevance of the results. revision: partial
-
Referee: [§5] §5 (PHOENIX-Mamba): The claim that the hyperbolic geometry enables 'self-discovered clustering of heterogeneous codec-fake modes' is load-bearing for the proposed method's novelty, yet no ablation removing the hyperbolic component, no quantitative clustering metrics (e.g., silhouette score), and no visualizations of the embedding space are referenced to isolate its contribution over plain PaSST.
Authors: We agree that further evidence is required to support the novelty of the hyperbolic component. In the revised §5, we will add an ablation study removing the hyperbolic geometry to quantify its impact on performance across conditions. Additionally, we will provide quantitative clustering metrics (such as silhouette scores) and visualizations of the embedding spaces to demonstrate the self-discovered modes and their separation in hyperbolic space compared to Euclidean alternatives. revision: yes
Circularity Check
No significant circularity; empirical benchmark results on newly introduced dataset
full rationale
The paper introduces a new task (HCFD) and releases a new paired dataset (Healthcare CodecFake) of real pathological speech and NAC-synthesized fakes. It then trains and evaluates models (including the proposed PHOENIX-Mamba) on splits of this dataset, reporting accuracies such as 97.04 on E-Dep. This is standard benchmark practice and does not reduce any claimed result to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation. No equations are presented that equate the performance metric to its own inputs by construction. The geometry-aware hyperbolic clustering is described as an empirical modeling choice whose benefit is demonstrated on the data, without being tautological. The paper is self-contained against its own external benchmark and contains independent content in dataset construction and model design.
Axiom & Free-Parameter Ledger
invented entities (1)
-
PHOENIX-Mamba
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Aho and Jeffrey D
Alfred V. Aho and Jeffrey D. Ullman , title =. 1972
1972
-
[2]
Publications Manual , year = "1983", publisher =
1983
-
[3]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[4]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
-
[5]
Dan Gusfield , title =. 1997
1997
-
[6]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[7]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[8]
Transactions on Machine Learning Research , issn=
High Fidelity Neural Audio Compression , author=. Transactions on Machine Learning Research , issn=. 2023 , url=
2023
-
[9]
Advances in Neural Information Processing Systems , volume=
High-fidelity audio compression with improved rvqgan , author=. Advances in Neural Information Processing Systems , volume=
-
[10]
IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=
Soundstream: An end-to-end neural audio codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2021 , publisher=
2021
-
[11]
arXiv preprint arXiv:2506.12627 , year=
Towards Neural Audio Codec Source Parsing , author=. arXiv preprint arXiv:2506.12627 , year=
-
[12]
Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation , year=
SNAC: Multi-Scale Neural Audio Codec , author=. Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation , year=
2024
-
[13]
ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=
Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=
2024
-
[14]
The Twelfth International Conference on Learning Representations , year=
SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[15]
High Fidelity Neural Audio Compression
High fidelity neural audio compression , author=. arXiv preprint arXiv:2210.13438 , year=
work page internal anchor Pith review arXiv
-
[16]
ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=
Audiodec: An open-source streaming high-fidelity neural audio codec , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=
2023
-
[17]
ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=
Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=
2024
-
[18]
The Twelfth International Conference on Learning Representations , year=
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis , author=. The Twelfth International Conference on Learning Representations , year=
-
[19]
CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems , author =. 2024 , booktitle =. doi:10.21437/Interspeech.2024-2093 , issn =
-
[20]
Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio , author =. 2024 , booktitle =. doi:10.21437/Interspeech.2024-2428 , issn =
-
[21]
IEEE Transactions on Audio, Speech and Language Processing , year=
The codecfake dataset and countermeasures for the universally detection of deepfake audio , author=. IEEE Transactions on Audio, Speech and Language Processing , year=
-
[22]
Codecfake+: A large-scale neu- ral audio codec-based deepfake speech dataset,
CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset , author=. arXiv preprint arXiv:2501.08238 , year=
-
[23]
Towards generalized source tracing for codec-based deepfake speech,
Towards Generalized Source Tracing for Codec-Based Deepfake Speech , author=. arXiv preprint arXiv:2506.07294 , year=
-
[24]
2025 , eprint=
Measuring the Robustness of Audio Deepfake Detectors , author=. 2025 , eprint=
2025
-
[25]
2022 , url =
Neil Zeghidour and Alejandro Luebs and Ahmed Omran and Jan Skoglund and Marco Tagliasacchi , title =. 2022 , url =
2022
-
[26]
High Fidelity Neural Audio Compression , journal =
Alexandre D. High Fidelity Neural Audio Compression , journal =. 2023 , url =
2023
-
[27]
AudioLM:
Zal. AudioLM:. 2023 , url =
2023
-
[28]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , author =. 2023 , eprint =. doi:10.48550/arXiv.2301.02111 , url =
work page internal anchor Pith review doi:10.48550/arxiv.2301.02111 2023
-
[29]
2024 , booktitle =
ASVspoof 5: crowdsourced speech data, deepfakes, and adversarial attacks at scale , author =. 2024 , booktitle =
2024
-
[30]
2021 , booktitle =
ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection , author =. 2021 , booktitle =
2021
-
[31]
doi:10.21437/Interspeech.2025-1442 , issn =
Neta Glazer and David Chernin and Idan Achituve and Sharon Gannot and Ethan Fetaya , year =. doi:10.21437/Interspeech.2025-1442 , issn =
-
[32]
What You Read Isn ' t What You Hear: Linguistic Sensitivity in Deepfake Speech Detection
Nguyen, Binh and Shi, Shuju and Ofman, Ryan and Le, Thai. What You Read Isn ' t What You Hear: Linguistic Sensitivity in Deepfake Speech Detection. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.794
-
[33]
The Distress Analysis Interview Corpus of human and computer interviews
Gratch, Jonathan and Artstein, Ron and Lucas, Gale and Stratou, Giota and Scherer, Stefan and Nazarian, Angela and Wood, Rachel and Boberg, Jill and DeVault, David and Marsella, Stacy and Traum, David and Rizzo, Skip and Morency, Louis-Philippe. The Distress Analysis Interview Corpus of human and computer interviews. Proceedings of the Ninth International...
2014
-
[34]
Automatic Depression Detection: an Emotional Audio-Textual Corpus and A Gru/Bilstm-Based Model , year=
Shen, Ying and Yang, Huiyu and Lin, Lin , booktitle=. Automatic Depression Detection: an Emotional Audio-Textual Corpus and A Gru/Bilstm-Based Model , year=
-
[35]
EasyCall Corpus: A Dysarthric Speech Dataset , author =. 2021 , booktitle =. doi:10.21437/Interspeech.2021-549 , issn =
-
[36]
doi:10.21437/Interspeech.2024-1597 , issn =
Yan Wan and Mengyi Sun and Xinchen Kang and Jingting Li and Pengfei Guo and Ming Gao and Su-Jing Wang , year =. doi:10.21437/Interspeech.2024-1597 , issn =
-
[37]
Rudzicz, Frank and Namasivayam, Aravind Kumar and Wolff, Talya , title =. Lang. Resour. Eval. , month = dec, pages =. 2012 , issue_date =. doi:10.1007/s10579-011-9145-0 , abstract =
-
[38]
Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge , author =. 2020 , booktitle =. doi:10.21437/Interspeech.2020-2571 , issn =
-
[39]
Shakeri, Arezo and Farmanbar, Mina and Balog, Krisztian , title =. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2025 , isbn =. doi:10.1145/3726302.3730313 , abstract =
-
[40]
IEEE/ACM transactions on audio, speech, and language processing , volume=
Audiolm: a language modeling approach to audio generation , author=. IEEE/ACM transactions on audio, speech, and language processing , volume=. 2023 , publisher=
2023
-
[41]
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=
X-Vectors: Robust DNN Embeddings for Speaker Recognition , author=. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=
2018
-
[42]
Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition , author=. arXiv preprint arXiv:2501.06514 , year=
-
[43]
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=
Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training , author=. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=
2022
-
[44]
Advances in neural information processing systems , volume=
wav2vec 2.0: A framework for self-supervised learning of speech representations , author=. Advances in neural information processing systems , volume=
-
[45]
IEEE Journal of Selected Topics in Signal Processing , volume=
Wavlm: Large-scale self-supervised pre-training for full stack speech processing , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2022 , publisher=
2022
-
[46]
International conference on machine learning , pages=
Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=
2023
-
[47]
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , author =. 2022 , booktitle =. doi:10.21437/Interspeech.2022-143 , issn =
-
[48]
AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks , year=
Jung, Jee-weon and Heo, Hee-Soo and Tak, Hemlata and Shim, Hye-jin and Chung, Joon Son and Lee, Bong-Jin and Yu, Ha-Jin and Evans, Nicholas , booktitle=. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks , year=
-
[49]
Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models , author =. Proc. INTERSPEECH , year =. doi:10.21437/Interspeech.2025-2171 , url =
-
[50]
Using Voice Biometric Authentication for Patient Privacy , howpublished =
-
[51]
AI Voice Deepfakes: Waging War on Healthcare , howpublished =
-
[52]
Voice Biometrics in Health Care , howpublished =
-
[53]
The Critical Role of Voice Biometrics in Healthcare Security , howpublished =
-
[54]
Akhtar, Mohd Mujtaba and Girish and Sheth, Farhan and Singh, Muskaan. Bridging Attribution and Open-Set Detection using Graph-Augmented Instance Learning in Synthetic Speech. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.250
-
[55]
Sheth, Farhan and Girish and Akhtar, Mohd Mujtaba and Singh, Muskaan. Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computa...
-
[56]
End-to-End anti-spoofing with RawNet2 , year=
Tak, Hemlata and Patino, Jose and Todisco, Massimiliano and Nautsch, Andreas and Evans, Nicholas and Larcher, Anthony , booktitle=. End-to-End anti-spoofing with RawNet2 , year=
-
[57]
doi:10.21437/Interspeech.2020-1810 , issn =
Zhenzong Wu and Rohan Kumar Das and Jichen Yang and Haizhou Li , year =. doi:10.21437/Interspeech.2020-1810 , issn =
-
[58]
SAMO: Speaker Attractor Multi-Center One-Class Learning For Voice Anti-Spoofing , year=
Ding, Siwen and Zhang, You and Duan, Zhiyao , booktitle=. SAMO: Speaker Attractor Multi-Center One-Class Learning For Voice Anti-Spoofing , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.