pith. sign in

arxiv: 2605.23859 · v1 · pith:S7QKYTBQnew · submitted 2026-05-22 · 📡 eess.AS

Natural Yet Challenging to Detect: Robust In-the-Wild TTS through EMA and Dual-Scoring Prompt Selection -- Submission for WildSpoof 2026 TTS Track

Pith reviewed 2026-05-25 02:17 UTC · model grok-4.3

classification 📡 eess.AS
keywords text-to-speechin-the-wild dataexponential moving averageprompt selectionspoofing detectionWildSpoof challengenaturalness
0
0 comments X

The pith

F5-TTS fine-tuned with EMA and dual LLM-LALM prompt scoring yields speech that ranks hardest to detect on three SASV systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The submission builds F5-TTS-DPS by adding exponential moving average during supervised fine-tuning and by using large language models together with large audio language models to score and select reference prompts from noisy in-the-wild data. The dual-scoring step is meant to remove low-quality or misaligned examples so that the generated speech stays faithful to the target speaker and text. On the development set the model records UTMOS of 3.20 and speaker similarity of 0.51 while posting the lowest a-DCF values of 0.1582, 0.5233 and 0.2562 across three different spoofing-attack detection systems. These numbers are presented as evidence that the output is both natural and the most difficult for current detectors to flag. The work therefore supplies a concrete recipe for turning noisy real-world recordings into high-fidelity synthetic speech that evades automated checks.

Core claim

F5-TTS-DPS integrates EMA into supervised fine-tuning to stabilize training and improve generalization; it further applies dual-scoring prompt selection with LLMs and LALMs to filter reference audio and text prompts, thereby addressing alignment problems in noisy datasets and producing output that achieves the best a-DCF scores of 0.1582, 0.5233 and 0.2562 on three advanced SASV systems together with competitive WER.

What carries the argument

Dual-scoring prompt selection that combines LLM and LALM scores to filter noisy in-the-wild reference prompts, used together with EMA during supervised fine-tuning of the F5-TTS base model.

If this is right

  • The synthesized speech exhibits the highest degree of naturalness and authenticity among submissions as measured by lowest a-DCF across three detectors.
  • EMA during fine-tuning stabilizes training and yields better generalization on in-the-wild data.
  • Dual-scoring prompt selection improves fidelity while maintaining competitive word-error-rate performance.
  • The combination produces audio with UTMOS 3.20 and speaker similarity 0.51 on the development set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt-quality filtering may prove more decisive than architecture changes when training TTS on noisy real-world corpora.
  • The gap in detectability suggests current SASV systems remain vulnerable to synthesis methods that explicitly optimize for naturalness and alignment.
  • The same EMA-plus-dual-scoring pipeline could be tested on other noisy TTS benchmarks to measure gains in speaker consistency.

Load-bearing premise

Dual-scoring with LLMs and LALMs successfully filters reference audio and text prompts to ensure quality and address alignment issues in noisy datasets.

What would settle it

A re-evaluation of the submitted audio on the same three SASV systems in which any other entry records lower a-DCF scores on all three detectors would falsify the ranking of undetectability.

read the original abstract

In this technical report, we describe our submission for the WildSpoof Challenge TTS Track: Text-to-Speech with In-the-Wild Data. We introduce F5-TTS-DPS, a model built upon the F5-TTS architecture. Our approach integrates Exponential Moving Average (EMA) into supervised fine-tuning to stabilize training and improve generalization. To enhance synthesis fidelity, we leverage large language models (LLMs) and large audio language models (LALMs) for dual-scoring prompt selection, filtering reference audio and text prompts to ensure quality while addressing alignment issues in noisy datasets. Experimental evaluation demonstrates that F5-TTS-DPS achieves strong performance with UTMOS of 3.20 and speaker similarity of 0.51 on the development set. More importantly, our model achieves the best a-DCF scores of 0.1582, 0.5233, and 0.2562 across three advanced SASV systems among all submissions, indicating our synthesized speech is the most difficult to detect and exhibits the highest degree of naturalness and authenticity. Combined with competitive WER performance, these results validate the effectiveness of our approach in generating natural-sounding speech with strong spoofing capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript is a technical report describing F5-TTS-DPS, an extension of the F5-TTS architecture for the WildSpoof 2026 TTS Track. It integrates Exponential Moving Average (EMA) into supervised fine-tuning for training stability and uses LLM/LALM dual-scoring to filter reference audio and text prompts for quality and alignment in noisy data. The report claims UTMOS of 3.20 and speaker similarity of 0.51 on the development set, plus the best a-DCF scores (0.1582, 0.5233, 0.2562) across three SASV systems among all submissions, attributing this to the proposed components and concluding that the synthesized speech is the most natural and difficult to detect.

Significance. If the reported a-DCF gains can be causally linked to EMA and dual-scoring prompt selection, the work would demonstrate practical techniques for producing in-the-wild TTS that robustly challenges state-of-the-art spoofing detectors while maintaining competitive naturalness metrics. The challenge results themselves provide an external benchmark, but the current lack of supporting evidence prevents a full assessment of significance.

major comments (1)
  1. [Abstract] Abstract: The central claim that EMA during fine-tuning plus LLM/LALM dual-scoring prompt selection produces the lowest a-DCF scores (0.1582/0.5233/0.2562) and 'the most difficult to detect' speech is load-bearing for the paper's contribution, yet the manuscript provides no ablation studies, baseline comparisons against plain F5-TTS, or single-component variants, nor any experimental protocol, statistical tests, or details on how the development-set metrics and challenge submissions were obtained.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'competitive WER performance' is stated without a numerical value or comparison, reducing clarity on overall system quality.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on our technical report for the WildSpoof 2026 TTS Track. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that EMA during fine-tuning plus LLM/LALM dual-scoring prompt selection produces the lowest a-DCF scores (0.1582/0.5233/0.2562) and 'the most difficult to detect' speech is load-bearing for the paper's contribution, yet the manuscript provides no ablation studies, baseline comparisons against plain F5-TTS, or single-component variants, nor any experimental protocol, statistical tests, or details on how the development-set metrics and challenge submissions were obtained.

    Authors: We agree the manuscript lacks ablation studies, comparisons to base F5-TTS, single-component variants, statistical tests, and full experimental protocol details. As a concise technical report for a challenge submission, the emphasis is on describing the system and reporting results that were externally validated through the challenge leaderboard against other entries. The development-set metrics (UTMOS 3.20, speaker similarity 0.51) and challenge a-DCF scores provide supporting context, with the latter serving as an independent benchmark. We did not perform the requested internal experiments. We will revise to expand the experimental protocol section with details on training, prompt selection, metric computation, and submission generation. revision: partial

standing simulated objections not resolved
  • Ablation studies, baseline comparisons against plain F5-TTS, single-component variants, and statistical tests, as these experiments were not conducted.

Circularity Check

0 steps flagged

Empirical challenge submission with external benchmark scores; no derivation or self-referential reduction

full rationale

The paper is a technical report describing an empirical TTS system (F5-TTS-DPS) trained with EMA and dual-scoring prompt selection, evaluated on the external WildSpoof challenge. All reported metrics (a-DCF, UTMOS, speaker similarity, WER) are direct outcomes on held-out challenge data. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes appear in the provided text. The central claim attributes performance to the two techniques but does not reduce any quantity to a definition or fit internal to the paper; the absence of ablations is a limitation of evidence strength, not circularity. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the untested premise that LLM/LALM dual scoring reliably improves prompt quality for in-the-wild TTS without introducing selection bias, plus the assumption that EMA stabilizes fine-tuning on noisy data in a way that directly translates to lower a-DCF.

axioms (2)
  • domain assumption Dual-scoring with LLMs and LALMs filters reference prompts to ensure quality and resolve alignment issues in noisy datasets.
    Invoked in the abstract as the mechanism for synthesis fidelity.
  • domain assumption EMA during supervised fine-tuning stabilizes training and improves generalization for in-the-wild TTS.
    Stated as the first technical addition in the abstract.

pith-pipeline@v0.9.0 · 5771 in / 1513 out tokens · 26150 ms · 2026-05-25T02:17:17.185160+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 5 internal anchors

  1. [1]

    F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

    Y . Chen, X. Ju, X. Tan,et al., “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” arXiv preprint arXiv:2410.06885, 2024

  2. [2]

    Acceleration of stochastic approximation by averaging,

    B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation by averaging,”SIAM journal on control and optimization, vol. 30, no. 4, pp. 838–855, 1992

  3. [3]

    Mean teachers are bet- ter role models: Weight-averaged consistency targets improve semi-supervised deep learning results,

    A. Tarvainen and H. Valpola, “Mean teachers are bet- ter role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” inAd- vances in neural information processing systems, 2017, pp. 1195–1204

  4. [4]

    a-dcf: an architecture agnostic metric with application to spoofing-robust speaker verification,

    H. jin Shim, J. weon Jung, T. Kinnunen, N. Evans, J.-F. Bonastre, and I. Lapidot, “a-dcf: an architecture agnostic metric with application to spoofing-robust speaker verification,” 2025. [Online]. Available: https: //arxiv.org/abs/2403.01355

  5. [5]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wang,et al., “Cosyvoice 2: Scalable streaming speech synthesis with large lan- guage models,”arXiv preprint arXiv:2412.10117, 2024

  6. [6]

    Text-to-speech synthesis in the wild,

    J. weon Jung, W. Zhang, S. Maiti, Y . Wu, X. Wang, J.-H. Kim, Y . Matsunaga, S. Um, J. Tian, H. jin Shim, N. Evans, J. S. Chung, S. Takamichi, and S. Watanabe, “Text-to-speech synthesis in the wild,” 2025. [Online]. Available: https://arxiv.org/abs/2409.08711

  7. [7]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20215

  8. [8]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  9. [9]

    Utmos: Utokyo-sarulab system for voicemos challenge 2022,

    Y . Saito, S. Takamichi, and H. Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,” arXiv preprint arXiv:2204.02152, 2022

  10. [10]

    Dnsmos: A non-intrusive perceptual objective speech quality met- ric to evaluate noise suppressors,

    C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality met- ric to evaluate noise suppressors,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6493–6497

  11. [11]

    Robust Speech Recognition via Large-Scale Weak Supervision

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2212.04356

  12. [12]

    ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models,

    J.-w. Jung, W. Zhang, J. Shi, Z. Aldeneh, T. Higuchi, B.-J. Theobald, A. H. Abdelaziz, and S. Watanabe, “ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models,”Proc. Interspeech 2024, 2024

  13. [13]

    ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech,

    X. Wang, J. Yamagishi, M. Todisco,et al., “ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech,”arXiv preprint arXiv:1911.01601, 2020

  14. [14]

    Wildspoof challenge evaluation plan,

    Y . Wu, J. weon Jung, H. jin Shim, X. Cheng, and X. Wang, “Wildspoof challenge evaluation plan,” 2025. [Online]. Available: https://arxiv.org/abs/2508.16858

  15. [15]

    Versa: A versatile evaluation toolkit for speech, audio, and music,

    J. Shi, H.-j. Shim, J. Tian, S. Arora, H. Wu, D. Peter- mann, J. Q. Yip, Y . Zhang, Y . Tang, W. Zhang,et al., “Versa: A versatile evaluation toolkit for speech, audio, and music,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (System Demonstratio...

  16. [16]

    Emotional Richness (4 points) - Clear emotional expression, dynamic range, engaging tone

  17. [17]

    Voice Expressiveness (3 points) - Varied intonation, natural emphasis, compelling delivery

  18. [18]

    Please evaluate the following audio and its corresponding text:""" A.2

    Prompt Suitability (3 points) - Distinctive characteristics, memorable voice, good reference quality Focus on identifying audio with: - Strong emotional expression and personality - Natural variations in pitch, pace, and intensity - Engaging and distinctive vocal characteristics - Clear demonstration of target speaking style Scoring Guidelines: - 9-10: Hi...

  19. [19]

    **Prosodic Alignment ** (0-10): Rhythm patterns, stress distribution, intonation flow, syllable timing

  20. [20]

    **Emotional Congruence ** (0-10): Emotional intensity, sentiment polarity, expressive quality

  21. [21]

    **Linguistic Compatibility ** (0-10): Sentence structure, phrase boundaries, syntactic complexity

  22. [22]

    **TTS Reference Suitability ** (0-10): Overall effectiveness as prosodic template **Input:** - **Target Text **: {text} - **Reference Candidates **: {reference} **Output Format: ** Output the selected reference sentence in <answer> </answer> directly <no-think>