pith. sign in

arxiv: 2505.07916 · v1 · pith:SNYC2IEVnew · submitted 2025-05-12 · 📡 eess.AS · cs.SD

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

classification 📡 eess.AS cs.SD
keywords voiceminimax-speechspeakertimbrecloningencoderfeaturesmodel
0
0 comments X
read the original abstract

We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

    cs.CV 2026-04 unverdicted novelty 7.0

    Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement ...

  2. MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

    cs.SD 2026-04 unverdicted novelty 7.0

    MAGIC-TTS is the first TTS system with explicit token-level duration and pause control that improves timing accuracy while preserving natural quality when controls are absent.

  3. Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

    cs.SD 2026-06 unverdicted novelty 6.0

    ScenA generates multi-speaker audio scenes by conditioning a flow-matching foundation model on reference voices and natural language prompts, using a high-noise-biased timestep schedule to prevent reference shortcut.

  4. EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

    cs.CL 2026-06 unverdicted novelty 6.0

    EmoInstruct-TTS uses Emotion2embed and an Instruction-Conditioned Emotion Flow Model (ICE-Flow) to generate acoustically grounded emotion representations from free-form instructions and integrate them into an LLM-base...

  5. RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

    cs.SD 2026-05 unverdicted novelty 6.0

    RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.

  6. OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...

  7. Qwen3-TTS Technical Report

    cs.SD 2026-01 unverdicted novelty 6.0

    Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5...

  8. Qwen3-Omni Technical Report

    cs.CL 2025-09 unverdicted novelty 6.0

    Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...

  9. Step-Audio 2 Technical Report

    cs.CL 2025-07 unverdicted novelty 6.0

    Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...

  10. VoxCPM2 Technical Report

    cs.SD 2026-06 unverdicted novelty 5.0

    VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.

  11. Voxtral TTS

    cs.AI 2026-03 unverdicted novelty 5.0

    Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in hu...

  12. PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

    cs.SD 2026-05 unverdicted novelty 4.0

    PilotTTS achieves lowest WER 1.50% (en) and CER 0.87% (zh) plus highest speaker similarity on Seed-TTS Eval using a Q-Former conditioned autoregressive architecture and a released multi-stage open data pipeline.