To evaluate the smoothness of transitions in both emotion and speaking rate, we adopt the DNSMOS Pro10 (Cumlin et al., 2024), referred as DNSM

ASR model to calculate Word Error Rate (WER)8, while for Chinese audio, we utilize a Paraformer (Gao et al · 2022 · arXiv 9563.4500

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

read on arXiv browse 1 citing papers

representative citing papers

TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

cs.SD · 2026-01-06 · unverdicted · novelty 7.0

A training-free framework for intra-utterance emotion and duration control in pretrained zero-shot TTS via segment-aware conditioning and steering strategies.

citing papers explorer

Showing 1 of 1 citing paper.

TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis cs.SD · 2026-01-06 · unverdicted · none · ref 5
A training-free framework for intra-utterance emotion and duration control in pretrained zero-shot TTS via segment-aware conditioning and steering strategies.

To evaluate the smoothness of transitions in both emotion and speaking rate, we adopt the DNSMOS Pro10 (Cumlin et al., 2024), referred as DNSM

fields

years

verdicts

representative citing papers

citing papers explorer