Natural Yet Challenging to Detect: Robust In-the-Wild TTS through EMA and Dual-Scoring Prompt Selection -- Submission for WildSpoof 2026 TTS Track
Pith reviewed 2026-05-25 02:17 UTC · model grok-4.3
The pith
F5-TTS fine-tuned with EMA and dual LLM-LALM prompt scoring yields speech that ranks hardest to detect on three SASV systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
F5-TTS-DPS integrates EMA into supervised fine-tuning to stabilize training and improve generalization; it further applies dual-scoring prompt selection with LLMs and LALMs to filter reference audio and text prompts, thereby addressing alignment problems in noisy datasets and producing output that achieves the best a-DCF scores of 0.1582, 0.5233 and 0.2562 on three advanced SASV systems together with competitive WER.
What carries the argument
Dual-scoring prompt selection that combines LLM and LALM scores to filter noisy in-the-wild reference prompts, used together with EMA during supervised fine-tuning of the F5-TTS base model.
If this is right
- The synthesized speech exhibits the highest degree of naturalness and authenticity among submissions as measured by lowest a-DCF across three detectors.
- EMA during fine-tuning stabilizes training and yields better generalization on in-the-wild data.
- Dual-scoring prompt selection improves fidelity while maintaining competitive word-error-rate performance.
- The combination produces audio with UTMOS 3.20 and speaker similarity 0.51 on the development set.
Where Pith is reading between the lines
- Prompt-quality filtering may prove more decisive than architecture changes when training TTS on noisy real-world corpora.
- The gap in detectability suggests current SASV systems remain vulnerable to synthesis methods that explicitly optimize for naturalness and alignment.
- The same EMA-plus-dual-scoring pipeline could be tested on other noisy TTS benchmarks to measure gains in speaker consistency.
Load-bearing premise
Dual-scoring with LLMs and LALMs successfully filters reference audio and text prompts to ensure quality and address alignment issues in noisy datasets.
What would settle it
A re-evaluation of the submitted audio on the same three SASV systems in which any other entry records lower a-DCF scores on all three detectors would falsify the ranking of undetectability.
read the original abstract
In this technical report, we describe our submission for the WildSpoof Challenge TTS Track: Text-to-Speech with In-the-Wild Data. We introduce F5-TTS-DPS, a model built upon the F5-TTS architecture. Our approach integrates Exponential Moving Average (EMA) into supervised fine-tuning to stabilize training and improve generalization. To enhance synthesis fidelity, we leverage large language models (LLMs) and large audio language models (LALMs) for dual-scoring prompt selection, filtering reference audio and text prompts to ensure quality while addressing alignment issues in noisy datasets. Experimental evaluation demonstrates that F5-TTS-DPS achieves strong performance with UTMOS of 3.20 and speaker similarity of 0.51 on the development set. More importantly, our model achieves the best a-DCF scores of 0.1582, 0.5233, and 0.2562 across three advanced SASV systems among all submissions, indicating our synthesized speech is the most difficult to detect and exhibits the highest degree of naturalness and authenticity. Combined with competitive WER performance, these results validate the effectiveness of our approach in generating natural-sounding speech with strong spoofing capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a technical report describing F5-TTS-DPS, an extension of the F5-TTS architecture for the WildSpoof 2026 TTS Track. It integrates Exponential Moving Average (EMA) into supervised fine-tuning for training stability and uses LLM/LALM dual-scoring to filter reference audio and text prompts for quality and alignment in noisy data. The report claims UTMOS of 3.20 and speaker similarity of 0.51 on the development set, plus the best a-DCF scores (0.1582, 0.5233, 0.2562) across three SASV systems among all submissions, attributing this to the proposed components and concluding that the synthesized speech is the most natural and difficult to detect.
Significance. If the reported a-DCF gains can be causally linked to EMA and dual-scoring prompt selection, the work would demonstrate practical techniques for producing in-the-wild TTS that robustly challenges state-of-the-art spoofing detectors while maintaining competitive naturalness metrics. The challenge results themselves provide an external benchmark, but the current lack of supporting evidence prevents a full assessment of significance.
major comments (1)
- [Abstract] Abstract: The central claim that EMA during fine-tuning plus LLM/LALM dual-scoring prompt selection produces the lowest a-DCF scores (0.1582/0.5233/0.2562) and 'the most difficult to detect' speech is load-bearing for the paper's contribution, yet the manuscript provides no ablation studies, baseline comparisons against plain F5-TTS, or single-component variants, nor any experimental protocol, statistical tests, or details on how the development-set metrics and challenge submissions were obtained.
minor comments (1)
- [Abstract] Abstract: The phrase 'competitive WER performance' is stated without a numerical value or comparison, reducing clarity on overall system quality.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our technical report for the WildSpoof 2026 TTS Track. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that EMA during fine-tuning plus LLM/LALM dual-scoring prompt selection produces the lowest a-DCF scores (0.1582/0.5233/0.2562) and 'the most difficult to detect' speech is load-bearing for the paper's contribution, yet the manuscript provides no ablation studies, baseline comparisons against plain F5-TTS, or single-component variants, nor any experimental protocol, statistical tests, or details on how the development-set metrics and challenge submissions were obtained.
Authors: We agree the manuscript lacks ablation studies, comparisons to base F5-TTS, single-component variants, statistical tests, and full experimental protocol details. As a concise technical report for a challenge submission, the emphasis is on describing the system and reporting results that were externally validated through the challenge leaderboard against other entries. The development-set metrics (UTMOS 3.20, speaker similarity 0.51) and challenge a-DCF scores provide supporting context, with the latter serving as an independent benchmark. We did not perform the requested internal experiments. We will revise to expand the experimental protocol section with details on training, prompt selection, metric computation, and submission generation. revision: partial
- Ablation studies, baseline comparisons against plain F5-TTS, single-component variants, and statistical tests, as these experiments were not conducted.
Circularity Check
Empirical challenge submission with external benchmark scores; no derivation or self-referential reduction
full rationale
The paper is a technical report describing an empirical TTS system (F5-TTS-DPS) trained with EMA and dual-scoring prompt selection, evaluated on the external WildSpoof challenge. All reported metrics (a-DCF, UTMOS, speaker similarity, WER) are direct outcomes on held-out challenge data. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes appear in the provided text. The central claim attributes performance to the two techniques but does not reduce any quantity to a definition or fit internal to the paper; the absence of ablations is a limitation of evidence strength, not circularity. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Dual-scoring with LLMs and LALMs filters reference prompts to ensure quality and resolve alignment issues in noisy datasets.
- domain assumption EMA during supervised fine-tuning stabilizes training and improves generalization for in-the-wild TTS.
Reference graph
Works this paper leans on
-
[1]
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Y . Chen, X. Ju, X. Tan,et al., “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” arXiv preprint arXiv:2410.06885, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Acceleration of stochastic approximation by averaging,
B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation by averaging,”SIAM journal on control and optimization, vol. 30, no. 4, pp. 838–855, 1992
work page 1992
-
[3]
A. Tarvainen and H. Valpola, “Mean teachers are bet- ter role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” inAd- vances in neural information processing systems, 2017, pp. 1195–1204
work page 2017
-
[4]
a-dcf: an architecture agnostic metric with application to spoofing-robust speaker verification,
H. jin Shim, J. weon Jung, T. Kinnunen, N. Evans, J.-F. Bonastre, and I. Lapidot, “a-dcf: an architecture agnostic metric with application to spoofing-robust speaker verification,” 2025. [Online]. Available: https: //arxiv.org/abs/2403.01355
-
[5]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wang,et al., “Cosyvoice 2: Scalable streaming speech synthesis with large lan- guage models,”arXiv preprint arXiv:2412.10117, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Text-to-speech synthesis in the wild,
J. weon Jung, W. Zhang, S. Maiti, Y . Wu, X. Wang, J.-H. Kim, Y . Matsunaga, S. Um, J. Tian, H. jin Shim, N. Evans, J. S. Chung, S. Takamichi, and S. Watanabe, “Text-to-speech synthesis in the wild,” 2025. [Online]. Available: https://arxiv.org/abs/2409.08711
-
[7]
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20215
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Utmos: Utokyo-sarulab system for voicemos challenge 2022,
Y . Saito, S. Takamichi, and H. Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,” arXiv preprint arXiv:2204.02152, 2022
-
[10]
Dnsmos: A non-intrusive perceptual objective speech quality met- ric to evaluate noise suppressors,
C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality met- ric to evaluate noise suppressors,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6493–6497
work page 2021
-
[11]
Robust Speech Recognition via Large-Scale Weak Supervision
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2212.04356
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
J.-w. Jung, W. Zhang, J. Shi, Z. Aldeneh, T. Higuchi, B.-J. Theobald, A. H. Abdelaziz, and S. Watanabe, “ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models,”Proc. Interspeech 2024, 2024
work page 2024
-
[13]
ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech,
X. Wang, J. Yamagishi, M. Todisco,et al., “ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech,”arXiv preprint arXiv:1911.01601, 2020
-
[14]
Wildspoof challenge evaluation plan,
Y . Wu, J. weon Jung, H. jin Shim, X. Cheng, and X. Wang, “Wildspoof challenge evaluation plan,” 2025. [Online]. Available: https://arxiv.org/abs/2508.16858
-
[15]
Versa: A versatile evaluation toolkit for speech, audio, and music,
J. Shi, H.-j. Shim, J. Tian, S. Arora, H. Wu, D. Peter- mann, J. Q. Yip, Y . Zhang, Y . Tang, W. Zhang,et al., “Versa: A versatile evaluation toolkit for speech, audio, and music,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (System Demonstratio...
work page 2025
-
[16]
Emotional Richness (4 points) - Clear emotional expression, dynamic range, engaging tone
-
[17]
Voice Expressiveness (3 points) - Varied intonation, natural emphasis, compelling delivery
-
[18]
Please evaluate the following audio and its corresponding text:""" A.2
Prompt Suitability (3 points) - Distinctive characteristics, memorable voice, good reference quality Focus on identifying audio with: - Strong emotional expression and personality - Natural variations in pitch, pace, and intensity - Engaging and distinctive vocal characteristics - Clear demonstration of target speaking style Scoring Guidelines: - 9-10: Hi...
-
[19]
**Prosodic Alignment ** (0-10): Rhythm patterns, stress distribution, intonation flow, syllable timing
-
[20]
**Emotional Congruence ** (0-10): Emotional intensity, sentiment polarity, expressive quality
-
[21]
**Linguistic Compatibility ** (0-10): Sentence structure, phrase boundaries, syntactic complexity
-
[22]
**TTS Reference Suitability ** (0-10): Overall effectiveness as prosodic template **Input:** - **Target Text **: {text} - **Reference Candidates **: {reference} **Output Format: ** Output the selected reference sentence in <answer> </answer> directly <no-think>
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.