Minimax- speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder

· 2025 · arXiv 2505.07916

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

cs.SD · 2026-04-23 · unverdicted · novelty 7.0

MAGIC-TTS is the first TTS system with explicit token-level duration and pause control that improves timing accuracy while preserving natural quality when controls are absent.

RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

cs.SD · 2026-05-21 · unverdicted · novelty 6.0

RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

cs.CL · 2026-04-01 · unverdicted · novelty 6.0

OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.

Qwen3-TTS Technical Report

cs.SD · 2026-01-22 · unverdicted · novelty 6.0

Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.

Qwen3-Omni Technical Report

cs.CL · 2025-09-22 · unverdicted · novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.

Step-Audio 2 Technical Report

cs.CL · 2025-07-22 · unverdicted · novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

Voxtral TTS

cs.AI · 2026-03-26 · unverdicted · novelty 5.0

Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 33
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 81
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

Minimax- speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer