RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations

Hieu-Thi Luong; Ivan Kukanov; Kong Aik Lee; Xuechen Liu; Zheng Xin Chai

arxiv: 2605.09568 · v2 · pith:FI2SAV6Anew · submitted 2026-05-10 · 📡 eess.AS

RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations

Hieu-Thi Luong , Xuechen Liu , Ivan Kukanov , Zheng Xin Chai , Kong Aik Lee This is my paper

Pith reviewed 2026-05-20 22:53 UTC · model grok-4.3

classification 📡 eess.AS

keywords audio deepfakemedia transformationsmultilingual detectionrobustnessequal error ratechallenge datasetfake audio recognition

0 comments

The pith

The RADAR Challenge 2026 shows that audio deepfake detectors remain unreliable under media transformations and across multiple languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up the RADAR Challenge to evaluate how well audio deepfake recognition systems handle realistic conditions. It provides labeled English data for development and a large multilingual evaluation set with more than 100,000 utterances in six languages. The audio undergoes transformations such as compression, resampling, noise addition, and reverberation to simulate distribution pipelines. Systems are scored by equal error rate in distinguishing real from fake audio. Submissions from 22 teams in the evaluation phase demonstrate ongoing difficulties in achieving robust performance.

Core claim

The authors construct a two-phase challenge with a multilingual dataset under media transformations and report evaluation results indicating that current deepfake detection approaches struggle to maintain low error rates when audio is altered by common media processing or presented in diverse languages.

What carries the argument

The challenge's dataset construction and evaluation protocol that applies compression, resampling, noise, and reverberation to audio samples from multiple languages for binary classification measured by equal error rate.

If this is right

Detectors must be designed to tolerate common audio processing steps to succeed in real applications.
Multilingual capabilities are essential for detectors to work across different linguistic contexts.
Future work should focus on improving generalization to unseen transformations and languages.
Challenges like this can serve as standardized tests to track progress in audio authenticity verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could use this benchmark to test new methods that explicitly model transformation effects.
Similar challenge structures might help evaluate deepfake detection in other media types such as video or images.
The results imply that current systems may overfit to clean, single-language training data.

Load-bearing premise

The selected media transformations and the way the dataset is built accurately reflect the conditions audio encounters when distributed in real-world pipelines.

What would settle it

If a submitted system achieves a very low equal error rate close to zero on the full multilingual transformed evaluation set, that would contradict the claim of remaining challenges and suggest robust detection is possible.

Figures

Figures reproduced from arXiv: 2605.09568 by Hieu-Thi Luong, Ivan Kukanov, Kong Aik Lee, Xuechen Liu, Zheng Xin Chai.

read the original abstract

RADAR Challenge 2026 is an APSIPA Grand Challenge on Robust Audio Deepfake Recognition under Media Transformations, designed to simulate realistic media conditions in real-world audio distribution pipelines, including compression, resampling, noise, and reverberation. It consists of two phases: an English development phase with labeled data for analysis and paper writing, and a multilingual evaluation phase containing more than 100,000 utterances in English, Singapore English, Mandarin Chinese, Taiwanese Mandarin, Japanese, and Vietnamese. Systems are evaluated using equal error rate (EER) for binary real/fake classification. This paper describes the challenge task, the construction of the data set, the evaluation protocol, and the overall results. During the challenge, 33 teams submitted to the development phase and 22 teams submitted to the final evaluation phase. The reported results highlight the remaining challenges of robust audio deepfake detection under multilingual and media-transformed conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard challenge setup paper that provides a new multilingual benchmark for audio deepfake detection under media transformations but offers no novel methods or deep analysis.

read the letter

The main point is that this paper organizes the RADAR Challenge 2026, complete with a new dataset of over 100,000 utterances in English, Singapore English, Mandarin, Taiwanese Mandarin, Japanese, and Vietnamese, plus media transformations like compression, resampling, noise, and reverberation. It splits into an English development phase and a multilingual evaluation phase, using equal error rate for real-versus-fake classification. Thirty-three teams entered development and twenty-two reached the final evaluation, with the results underscoring that current systems still struggle under these conditions.

Referee Report

1 major / 2 minor

Summary. The manuscript describes the RADAR Challenge 2026, an APSIPA Grand Challenge on robust audio deepfake recognition under media transformations. It outlines the two-phase structure (English development phase with labeled data and multilingual evaluation phase with >100,000 utterances across English, Singapore English, Mandarin Chinese, Taiwanese Mandarin, Japanese, and Vietnamese), the evaluation protocol using equal error rate (EER) for real/fake binary classification, participation numbers (33 development and 22 evaluation submissions), and concludes that the results highlight remaining challenges under multilingual and media-transformed conditions.

Significance. If the dataset construction and transformations are accepted as a reasonable proxy for real-world conditions, the work is significant for establishing a community benchmark that addresses gaps in multilingual and transformed audio deepfake detection. The high participation and explicit focus on realistic media pipelines (compression, resampling, noise, reverberation) can stimulate targeted research advances. The descriptive documentation of task, data, and protocol provides a reusable reference point for the field.

major comments (1)

Abstract and dataset construction section: the assertion that the chosen transformations (compression, resampling, noise, reverberation) 'simulate realistic media conditions in real-world audio distribution pipelines' is presented without citations to empirical studies or quantitative validation of the specific parameter ranges, which is load-bearing for interpreting the reported EER outcomes as evidence of real-world robustness challenges.

minor comments (2)

The manuscript would benefit from a table summarizing the exact media transformation parameters applied to the evaluation set to improve reproducibility.
Ensure consistent use of language names (e.g., 'Singapore English' vs. 'Singlish') across sections and the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [—] Abstract and dataset construction section: the assertion that the chosen transformations (compression, resampling, noise, reverberation) 'simulate realistic media conditions in real-world audio distribution pipelines' is presented without citations to empirical studies or quantitative validation of the specific parameter ranges, which is load-bearing for interpreting the reported EER outcomes as evidence of real-world robustness challenges.

Authors: We agree that the manuscript would benefit from explicit citations and justification for the transformation parameters. While the chosen degradations (compression, resampling, additive noise, and reverberation) reflect standard operations in real-world audio pipelines such as social-media upload, VoIP, and broadcast, we did not include supporting references in the initial submission. In the revised manuscript we will add citations to relevant studies on audio degradation in media distribution and provide a short rationale for the selected parameter ranges drawn from common practice in the audio-forensics literature. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely descriptive description of an APSIPA Grand Challenge setup. It defines the task, constructs a dataset with specified media transformations, states the EER evaluation protocol for binary classification, and reports aggregated results from 33 development and 22 evaluation submissions by external teams. No derivations, equations, predictions, or self-referential claims appear; the central statement that results highlight remaining challenges follows directly from the participation numbers and observed performance without reducing to any fitted input or self-citation by construction. The multilingual and transformation conditions are presented as the challenge definition itself rather than derived outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work consists of challenge organization and reporting of participant performance on a constructed dataset.

pith-pipeline@v0.9.0 · 5698 in / 1085 out tokens · 79845 ms · 2026-05-20T22:53:28.634446+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Systems are evaluated using equal error rate (EER) for binary real/fake classification... 33 teams submitted to the development phase and 22 teams submitted to the final evaluation phase.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The reported results highlight the remaining challenges of robust audio deepfake detection under multilingual and media-transformed conditions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 5 internal anchors

[1]

ASVspoof 2019: A large-scale pub- lic database of synthesized, converted and replayed speech,

X. Wang et al., “ASVspoof 2019: A large-scale pub- lic database of synthesized, converted and replayed speech,”Computer Speech & Language, vol. 64, p. 101 114, Nov. 2020,ISSN: 08852308

work page 2019
[2]

IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 1161–1173 (2021).https://doi.org/10.1109/TASLP

X. Liu et al., “ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild,”IEEE/ACM Transactions on Audio, Speech, and Language Process- ing, vol. 31, pp. 2507–2522, 2023.DOI: 10.1109/TASLP. 2023.3285283

work page doi:10.1109/taslp 2021
[3]

New air ﬂu- orescence detectors employed in the Telescope Array experiment

X. Wang et al., “ASVspoof 5: Design, collection and validation of resources for spoofing, deepfake, and ad- versarial attack detection using crowdsourced speech,” Computer Speech & Language, vol. 95, p. 101 825, 2026,ISSN: 0885-2308.DOI: https://doi.org/10.1016/j. csl.2025.101825

work page doi:10.1016/j 2026
[4]

Safe: Synthetic audio forensics evalua- tion challenge,

T. Kirill et al., “Safe: Synthetic audio forensics evalua- tion challenge,” inProc. ACM IH&MMSEC Workshop, 2025, pp. 174–180

work page 2025
[5]

, author Zhou, A

J. Yi et al., “ADD 2022: The first audio deep synthesis detection challenge,” inProc. ICASSP, 2022, pp. 9216– 9220.DOI: 10.1109/ICASSP43922.2022.9746939

work page doi:10.1109/icassp43922.2022.9746939 2022
[6]

ADD 2023: The Second Audio Deepfake Detection Challenge,

J. Yi et al., “ADD 2023: The Second Audio Deepfake Detection Challenge,” inProc. IJCAI DADA Workshop, May 2023

work page 2023
[7]

Perturbed public voices (p 2v): A dataset for robust audio deepfake detection,

C. Gao, M. Postiglione, I. Gortner, S. Kraus, and V . Sub- rahmanian, “Perturbed public voices (p 2v): A dataset for robust audio deepfake detection,”arXiv preprint arXiv:2508.10949, 2025

work page arXiv 2025
[8]

Room impulse responses help attackers to evade deep fake detection,

H.-T. Luong, D.-T. Truong, K. A. Lee, and E. S. Chng, “Room impulse responses help attackers to evade deep fake detection,” inProc. SLT 2024, IEEE, 2024, pp. 623–629

work page 2024
[9]

DeePen: Penetration Testing for Audio Deepfake Detection

N. M ¨uller et al., “Deepen: Penetration testing for audio deepfake detection,”arXiv preprint arXiv:2502.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Investigating the impact of speech enhancement on audio deep- fake detection in noisy environments,

S. Kshirsagar, A. R. Avila, et al., “Investigating the impact of speech enhancement on audio deep- fake detection in noisy environments,”arXiv preprint arXiv:2603.14767, 2026

work page arXiv 2026
[11]

Mlaad: The multi-language audio anti-spoofing dataset,

N. M. M ¨uller et al., “Mlaad: The multi-language audio anti-spoofing dataset,” inProc. IJCNN 2024, IEEE, 2024, pp. 1–7

work page 2024
[12]

Sea-spoof: Bridging the gap in multilingual audio deepfake detection for south-east asian,

J. Wu, N. Hou, Z. Pan, Q. Zhang, S. H. Bhupendra, and S. Mondal, “Sea-spoof: Bridging the gap in multilingual audio deepfake detection for south-east asian,”arXiv preprint arXiv:2509.19865, 2025

work page arXiv 2025
[13]

Llamapartialspoof: An llm-driven fake speech dataset simulating disinformation generation,

H.-T. Luong, H. Li, L. Zhang, K. A. Lee, and E. S. Chng, “Llamapartialspoof: An llm-driven fake speech dataset simulating disinformation generation,” inProc. ICASSP, 2025.DOI: 10 . 1109 / ICASSP49660 . 2025 . 10888070

work page 2025
[14]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,

H. Zen et al., “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” inInterspeech 2019, 2019, pp. 2638–2642

work page 2019
[15]

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech,

J. Lim, J. Ye, S. Chun, S. Kim, and J. Cho, “JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech,” inInterspeech 2022, 2022, pp. 2338–2342

work page 2022
[16]

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for Everyone,

E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. G¨olge, and M. A. Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for Everyone,” inICML, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., ser. Proceed- ings of Machine Learning Research, vol. 162, PMLR, 17–23 Jul 2022, pp. 2709–2720

work page 2022
[17]

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,

E. Casanova et al., “XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,” inInterspeech 2024, 2024, pp. 4978–4982

work page 2024
[18]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Z. Du et al., “Cosyvoice: A scalable multilingual zero- shot text-to-speech synthesizer based on supervised se- mantic tokens,”arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Common voice: A massively- multilingual speech corpus,

R. Ardila et al., “Common voice: A massively- multilingual speech corpus,” inProceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, pp. 4211–4215

work page 2020
[20]

Galvez, G

D. Galvez et al., “The people’s speech: A large-scale di- verse english speech recognition dataset for commercial usage,”arXiv preprint arXiv:2111.09344, 2021

work page arXiv 2021
[21]

Building the Singapore English National Speech Corpus,

J. X. Koh et al., “Building the Singapore English National Speech Corpus,” inInterspeech 2019, pp. 321– 325

work page 2019
[22]

imagicdatatech.com/index.php/home/dataopensource/ data info/id/101, Accessed: 2019-05, 2019

Magic Data Technology Co., Ltd.,Openslr68: Magic- data mandarin chinese read speech corpus, http://www. imagicdatatech.com/index.php/home/dataopensource/ data info/id/101, Accessed: 2019-05, 2019

work page 2019
[23]

Formosa speech recognition challenge 2020 and taiwanese across taiwan corpus,

Y .-F. Liao et al., “Formosa speech recognition challenge 2020 and taiwanese across taiwan corpus,” inProc. O- COCOSDA 2020, IEEE, 2020, pp. 65–70

work page 2020
[24]

Cpjd corpus: Crowd- sourced parallel speech corpus of japanese dialects,

S. Takamichi and H. Saruwatari, “Cpjd corpus: Crowd- sourced parallel speech corpus of japanese dialects,” in Proc. LREC 2018, 2018

work page 2018
[25]

D. C. Tran,FPT Open Speech Dataset (FOSD) - Viet- namese, version V4, Mendeley Data, 2020.DOI: 10 . 17632/k9sxg2twv4.4

work page 2020
[26]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Z. Du et al., “CosyV oice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training,”arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Qwen3-TTS Technical Report

H. Hu et al., “Qwen3-tts technical report,”arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Fish audio s2 technical report,

S. Liao et al., “Fish audio s2 technical report,”arXiv preprint arXiv:2603.08823, 2026

work page arXiv 2026
[29]

Statistics of natural reverberation enable perceptual separation of sound and space,

J. Traer and J. H. McDermott, “Statistics of natural reverberation enable perceptual separation of sound and space,”PNAS, vol. 113, no. 48, E7856–E7865, 2016

work page 2016
[30]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[31]

Fma: A dataset for music analysis,

M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “Fma: A dataset for music analysis,” in18th International Society for Music Information Retrieval Conference, 2017

work page 2017
[32]

A binaural room im- pulse response database for the evaluation of dereverber- ation algorithms,

M. Jeub, M. Schafer, and P. Vary, “A binaural room im- pulse response database for the evaluation of dereverber- ation algorithms,” in2009 16th international conference on digital signal processing, IEEE, 2009, pp. 1–5

work page 2009
[33]

Image method for ef- ficiently simulating small-room acoustics,

J. B. Allen and D. A. Berkley, “Image method for ef- ficiently simulating small-room acoustics,”The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979

work page 1979
[34]

Hierarchical and multimodal learning for hetero- geneous sound classification,

P. Anastasopoulou, F. A. Dal R ´ı, X. Serra, and F. Font, “Hierarchical and multimodal learning for hetero- geneous sound classification,” inProc. DCASE 2025, 2025

work page 2025
[35]

Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2vec 2.0 and Data Augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2vec 2.0 and Data Augmentation,” inProc. Odyssey 2022, 2022, pp. 112– 119

work page 2022
[36]

Robust localization of partially fake speech: Metrics and out-of-domain evaluation,

H.-T. Luong, I. Rimon, H. Permuter, K. A. Lee, and E. S. Chng, “Robust localization of partially fake speech: Metrics and out-of-domain evaluation,” inProc. APSIPA ASC 2025, IEEE, 2025, pp. 2205–2210

work page 2025
[37]

Measuring the ro- bustness of audio deepfake detectors,

X. Li, P.-Y . Chen, and W. Wei, “Measuring the ro- bustness of audio deepfake detectors,”arXiv preprint arXiv:2503.17577, 2025

work page arXiv 2025
[38]

Replay attacks against audio deepfake detection,

N. M ¨uller et al., “Replay attacks against audio deepfake detection,”Interspeech 2025, 2025

work page 2025

[1] [1]

ASVspoof 2019: A large-scale pub- lic database of synthesized, converted and replayed speech,

X. Wang et al., “ASVspoof 2019: A large-scale pub- lic database of synthesized, converted and replayed speech,”Computer Speech & Language, vol. 64, p. 101 114, Nov. 2020,ISSN: 08852308

work page 2019

[2] [2]

IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 1161–1173 (2021).https://doi.org/10.1109/TASLP

X. Liu et al., “ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild,”IEEE/ACM Transactions on Audio, Speech, and Language Process- ing, vol. 31, pp. 2507–2522, 2023.DOI: 10.1109/TASLP. 2023.3285283

work page doi:10.1109/taslp 2021

[3] [3]

New air ﬂu- orescence detectors employed in the Telescope Array experiment

X. Wang et al., “ASVspoof 5: Design, collection and validation of resources for spoofing, deepfake, and ad- versarial attack detection using crowdsourced speech,” Computer Speech & Language, vol. 95, p. 101 825, 2026,ISSN: 0885-2308.DOI: https://doi.org/10.1016/j. csl.2025.101825

work page doi:10.1016/j 2026

[4] [4]

Safe: Synthetic audio forensics evalua- tion challenge,

T. Kirill et al., “Safe: Synthetic audio forensics evalua- tion challenge,” inProc. ACM IH&MMSEC Workshop, 2025, pp. 174–180

work page 2025

[5] [5]

, author Zhou, A

J. Yi et al., “ADD 2022: The first audio deep synthesis detection challenge,” inProc. ICASSP, 2022, pp. 9216– 9220.DOI: 10.1109/ICASSP43922.2022.9746939

work page doi:10.1109/icassp43922.2022.9746939 2022

[6] [6]

ADD 2023: The Second Audio Deepfake Detection Challenge,

J. Yi et al., “ADD 2023: The Second Audio Deepfake Detection Challenge,” inProc. IJCAI DADA Workshop, May 2023

work page 2023

[7] [7]

Perturbed public voices (p 2v): A dataset for robust audio deepfake detection,

C. Gao, M. Postiglione, I. Gortner, S. Kraus, and V . Sub- rahmanian, “Perturbed public voices (p 2v): A dataset for robust audio deepfake detection,”arXiv preprint arXiv:2508.10949, 2025

work page arXiv 2025

[8] [8]

Room impulse responses help attackers to evade deep fake detection,

H.-T. Luong, D.-T. Truong, K. A. Lee, and E. S. Chng, “Room impulse responses help attackers to evade deep fake detection,” inProc. SLT 2024, IEEE, 2024, pp. 623–629

work page 2024

[9] [9]

DeePen: Penetration Testing for Audio Deepfake Detection

N. M ¨uller et al., “Deepen: Penetration testing for audio deepfake detection,”arXiv preprint arXiv:2502.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Investigating the impact of speech enhancement on audio deep- fake detection in noisy environments,

S. Kshirsagar, A. R. Avila, et al., “Investigating the impact of speech enhancement on audio deep- fake detection in noisy environments,”arXiv preprint arXiv:2603.14767, 2026

work page arXiv 2026

[11] [11]

Mlaad: The multi-language audio anti-spoofing dataset,

N. M. M ¨uller et al., “Mlaad: The multi-language audio anti-spoofing dataset,” inProc. IJCNN 2024, IEEE, 2024, pp. 1–7

work page 2024

[12] [12]

Sea-spoof: Bridging the gap in multilingual audio deepfake detection for south-east asian,

J. Wu, N. Hou, Z. Pan, Q. Zhang, S. H. Bhupendra, and S. Mondal, “Sea-spoof: Bridging the gap in multilingual audio deepfake detection for south-east asian,”arXiv preprint arXiv:2509.19865, 2025

work page arXiv 2025

[13] [13]

Llamapartialspoof: An llm-driven fake speech dataset simulating disinformation generation,

H.-T. Luong, H. Li, L. Zhang, K. A. Lee, and E. S. Chng, “Llamapartialspoof: An llm-driven fake speech dataset simulating disinformation generation,” inProc. ICASSP, 2025.DOI: 10 . 1109 / ICASSP49660 . 2025 . 10888070

work page 2025

[14] [14]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,

H. Zen et al., “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” inInterspeech 2019, 2019, pp. 2638–2642

work page 2019

[15] [15]

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech,

J. Lim, J. Ye, S. Chun, S. Kim, and J. Cho, “JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech,” inInterspeech 2022, 2022, pp. 2338–2342

work page 2022

[16] [16]

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for Everyone,

E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. G¨olge, and M. A. Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for Everyone,” inICML, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., ser. Proceed- ings of Machine Learning Research, vol. 162, PMLR, 17–23 Jul 2022, pp. 2709–2720

work page 2022

[17] [17]

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,

E. Casanova et al., “XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,” inInterspeech 2024, 2024, pp. 4978–4982

work page 2024

[18] [18]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Z. Du et al., “Cosyvoice: A scalable multilingual zero- shot text-to-speech synthesizer based on supervised se- mantic tokens,”arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Common voice: A massively- multilingual speech corpus,

R. Ardila et al., “Common voice: A massively- multilingual speech corpus,” inProceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, pp. 4211–4215

work page 2020

[20] [20]

Galvez, G

D. Galvez et al., “The people’s speech: A large-scale di- verse english speech recognition dataset for commercial usage,”arXiv preprint arXiv:2111.09344, 2021

work page arXiv 2021

[21] [21]

Building the Singapore English National Speech Corpus,

J. X. Koh et al., “Building the Singapore English National Speech Corpus,” inInterspeech 2019, pp. 321– 325

work page 2019

[22] [22]

imagicdatatech.com/index.php/home/dataopensource/ data info/id/101, Accessed: 2019-05, 2019

Magic Data Technology Co., Ltd.,Openslr68: Magic- data mandarin chinese read speech corpus, http://www. imagicdatatech.com/index.php/home/dataopensource/ data info/id/101, Accessed: 2019-05, 2019

work page 2019

[23] [23]

Formosa speech recognition challenge 2020 and taiwanese across taiwan corpus,

Y .-F. Liao et al., “Formosa speech recognition challenge 2020 and taiwanese across taiwan corpus,” inProc. O- COCOSDA 2020, IEEE, 2020, pp. 65–70

work page 2020

[24] [24]

Cpjd corpus: Crowd- sourced parallel speech corpus of japanese dialects,

S. Takamichi and H. Saruwatari, “Cpjd corpus: Crowd- sourced parallel speech corpus of japanese dialects,” in Proc. LREC 2018, 2018

work page 2018

[25] [25]

D. C. Tran,FPT Open Speech Dataset (FOSD) - Viet- namese, version V4, Mendeley Data, 2020.DOI: 10 . 17632/k9sxg2twv4.4

work page 2020

[26] [26]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Z. Du et al., “CosyV oice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training,”arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Qwen3-TTS Technical Report

H. Hu et al., “Qwen3-tts technical report,”arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Fish audio s2 technical report,

S. Liao et al., “Fish audio s2 technical report,”arXiv preprint arXiv:2603.08823, 2026

work page arXiv 2026

[29] [29]

Statistics of natural reverberation enable perceptual separation of sound and space,

J. Traer and J. H. McDermott, “Statistics of natural reverberation enable perceptual separation of sound and space,”PNAS, vol. 113, no. 48, E7856–E7865, 2016

work page 2016

[30] [30]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[31] [31]

Fma: A dataset for music analysis,

M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “Fma: A dataset for music analysis,” in18th International Society for Music Information Retrieval Conference, 2017

work page 2017

[32] [32]

A binaural room im- pulse response database for the evaluation of dereverber- ation algorithms,

M. Jeub, M. Schafer, and P. Vary, “A binaural room im- pulse response database for the evaluation of dereverber- ation algorithms,” in2009 16th international conference on digital signal processing, IEEE, 2009, pp. 1–5

work page 2009

[33] [33]

Image method for ef- ficiently simulating small-room acoustics,

J. B. Allen and D. A. Berkley, “Image method for ef- ficiently simulating small-room acoustics,”The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979

work page 1979

[34] [34]

Hierarchical and multimodal learning for hetero- geneous sound classification,

P. Anastasopoulou, F. A. Dal R ´ı, X. Serra, and F. Font, “Hierarchical and multimodal learning for hetero- geneous sound classification,” inProc. DCASE 2025, 2025

work page 2025

[35] [35]

Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2vec 2.0 and Data Augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2vec 2.0 and Data Augmentation,” inProc. Odyssey 2022, 2022, pp. 112– 119

work page 2022

[36] [36]

Robust localization of partially fake speech: Metrics and out-of-domain evaluation,

H.-T. Luong, I. Rimon, H. Permuter, K. A. Lee, and E. S. Chng, “Robust localization of partially fake speech: Metrics and out-of-domain evaluation,” inProc. APSIPA ASC 2025, IEEE, 2025, pp. 2205–2210

work page 2025

[37] [37]

Measuring the ro- bustness of audio deepfake detectors,

X. Li, P.-Y . Chen, and W. Wei, “Measuring the ro- bustness of audio deepfake detectors,”arXiv preprint arXiv:2503.17577, 2025

work page arXiv 2025

[38] [38]

Replay attacks against audio deepfake detection,

N. M ¨uller et al., “Replay attacks against audio deepfake detection,”Interspeech 2025, 2025

work page 2025