MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Bowen Zhang; Congchao Guo; Geng Yang; Hang Yu; Haozhe Zhang; Heidi Lei; Jialong Mai; Junjie Yan; Kaiyue Yang; Mingqi Yang

arxiv: 2505.07916 · v1 · pith:SNYC2IEVnew · submitted 2025-05-12 · 📡 eess.AS · cs.SD

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Bowen Zhang , Congchao Guo , Geng Yang , Hang Yu , Haozhe Zhang , Heidi Lei , Jialong Mai , Junjie Yan

show 12 more authors

Kaiyue Yang Mingqi Yang Peikai Huang Ruiyang Jin Sitan Jiang Weihua Cheng Yawei Li Yichen Xiao Yiying Zhou Yongmao Zhang Yuan Lu Yucen He

This is my paper

classification 📡 eess.AS cs.SD

keywords voiceminimax-speechspeakertimbrecloningencoderfeaturesmodel

0 comments

read the original abstract

We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
cs.CV 2026-04 unverdicted novelty 7.0

Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement ...
MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control
cs.SD 2026-04 unverdicted novelty 7.0

MAGIC-TTS is the first TTS system with explicit token-level duration and pause control that improves timing accuracy while preserving natural quality when controls are absent.
Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors
cs.SD 2026-06 unverdicted novelty 6.0

ScenA generates multi-speaker audio scenes by conditioning a flow-matching foundation model on reference voices and natural language prompts, using a high-noise-biased timestep schedule to prevent reference shortcut.
EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis
cs.CL 2026-06 unverdicted novelty 6.0

EmoInstruct-TTS uses Emotion2embed and an Instruction-Conditioned Emotion Flow Model (ICE-Flow) to generate acoustically grounded emotion representations from free-form instructions and integrate them into an LLM-base...
RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching
cs.SD 2026-05 unverdicted novelty 6.0

RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...
Qwen3-TTS Technical Report
cs.SD 2026-01 unverdicted novelty 6.0

Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5...
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
Step-Audio 2 Technical Report
cs.CL 2025-07 unverdicted novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
VoxCPM2 Technical Report
cs.SD 2026-06 unverdicted novelty 5.0

VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.
Voxtral TTS
cs.AI 2026-03 unverdicted novelty 5.0

Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in hu...
PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
cs.SD 2026-05 unverdicted novelty 4.0

PilotTTS achieves lowest WER 1.50% (en) and CER 0.87% (zh) plus highest speaker similarity on Seed-TTS Eval using a Q-Former conditioned autoregressive architecture and a released multi-stage open data pipeline.