CodecFake+: Codec-Based Resynthesized Data as a Proxy for Detecting CodecFake Speech

Haibin Wu; Hung-yi Lee; I-Hsiang Chiu; I-Ming Lin; Jiawei Du; Jyh-Shing Roger Jang; Lin Zhang; Wenze Ren; Xuanjun Chen; Yuan Tseng

arxiv: 2501.08238 · v3 · pith:NKWIXP3Ynew · submitted 2025-01-14 · 💻 cs.SD · eess.AS

CodecFake+: Codec-Based Resynthesized Data as a Proxy for Detecting CodecFake Speech

Xuanjun Chen , Jiawei Du , Haibin Wu , Lin Zhang , I-Ming Lin , I-Hsiang Chiu , Wenze Ren , Yuan Tseng

show 3 more authors

Yu Tsao Jyh-Shing Roger Jang Hung-yi Lee

This is my paper

classification 💻 cs.SD eess.AS

keywords codecfakespeechdatadetectioncodeccosgmodelstaxonomy

0 comments

read the original abstract

With the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech
cs.SD 2026-06 unverdicted novelty 7.0

Introduces DSFA to turn deterministic audio features stochastic during fine-tuning and the CoSG ExtEval dataset, claiming SOTA generalization for CodecFake detection.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
eess.AS 2026-04 unverdicted novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection
cs.SD 2026-03 unverdicted novelty 7.0

MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.
Bridging the Age Gap: Towards Detecting Neural Audio Codec Synthesized Elderly Speech Deepfake
eess.AS 2026-06 unverdicted novelty 6.0

Defines ECFD task, releases ECF dataset, demonstrates poor generalization of prior detectors to elderly speech, and introduces BONSAI fusion of LanguageBind and ImageBind achieving 1.66% average EER.
HCFD: A Benchmark for Audio Deepfake Detection in Healthcare
eess.AS 2026-04 unverdicted novelty 6.0

HCFD is a new pathology-aware benchmark and dataset for codec-fake audio detection in healthcare, with PHOENIX-Mamba achieving up to 97% accuracy by modeling fakes as modes in hyperbolic space.
From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning
eess.AS 2026-07 unverdicted novelty 3.0

A survey that organizes audio SSL into five objective paradigms, relates their demands to architectural biases, and interprets downstream applications as tests of generalization.
On The Landscape of Spoken Language Models: A Comprehensive Survey
cs.CL 2025-04 unverdicted novelty 3.0

A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.