MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.
Speechalign: Aligning speech generation to human preferences
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.SD 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Human preferences for the same semantic content show near-chance agreement between text and audio, with audio raters using narrower decision thresholds, less length bias, and more user-oriented criteria.
citing papers explorer
-
Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection
MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.
-
Same Words, Different Judgments: How Preferences Vary Across Modalities
Human preferences for the same semantic content show near-chance agreement between text and audio, with audio raters using narrower decision thresholds, less length bias, and more user-oriented criteria.