pith. sign in

Title resolution pending

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.

fields

eess.AS 1

years

2026 1

verdicts

UNVERDICTED 1

representative citing papers

Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech

eess.AS · 2026-05-20 · unverdicted · novelty 5.0

Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.

citing papers explorer

Showing 1 of 1 citing paper.

  • Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech eess.AS · 2026-05-20 · unverdicted · none · ref 17 · internal anchor

    Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.