pith. sign in

Minimax- speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

citation-role summary

dataset 1

citation-polarity summary

years

2026 6 2025 2

verdicts

UNVERDICTED 8

roles

dataset 1

polarities

use dataset 1

representative citing papers

Qwen3-TTS Technical Report

cs.SD · 2026-01-22 · unverdicted · novelty 6.0

Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.

Qwen3-Omni Technical Report

cs.CL · 2025-09-22 · unverdicted · novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.

Step-Audio 2 Technical Report

cs.CL · 2025-07-22 · unverdicted · novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

Voxtral TTS

cs.AI · 2026-03-26 · unverdicted · novelty 5.0

Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.

citing papers explorer

Showing 8 of 8 citing papers.

  • Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling cs.CV · 2026-04-26 · unverdicted · none · ref 29

    Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.

  • MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control cs.SD · 2026-04-23 · unverdicted · none · ref 4

    MAGIC-TTS is the first TTS system with explicit token-level duration and pause control that improves timing accuracy while preserving natural quality when controls are absent.

  • RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching cs.SD · 2026-05-21 · unverdicted · none · ref 31

    RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.

  • OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models cs.CL · 2026-04-01 · unverdicted · none · ref 51

    OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.

  • Qwen3-TTS Technical Report cs.SD · 2026-01-22 · unverdicted · none · ref 25

    Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.

  • Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 33

    Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.

  • Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 81

    Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

  • Voxtral TTS cs.AI · 2026-03-26 · unverdicted · none · ref 20

    Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.