HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.
hub
Vibevoice technical report
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.
CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.
SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.
StreamChar decouples LLM-based orchestration from DiT denoising to achieve real-time long-horizon streaming character audio-video generation with reduced drift and misalignment.
Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.
VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.
F3-Tokenizer adapts audio autoencoder latents with noise-regularized bottleneck (channel normalization and stochastic perturbation) and a representation encoder (RQ-MTP plus frozen-LLM supervision) to support both high-dimensional understanding representations and normalized continuous generation ta
UNIQUE enables efficient top-k sparse attention in LLMs by using a mean-plus-std page importance score and a soft-mask training approach, achieving up to 11.4x kernel speedup while preserving performance.
PilotTTS achieves lowest WER 1.50% (en) and CER 0.87% (zh) plus highest speaker similarity on Seed-TTS Eval using a Q-Former conditioned autoregressive architecture and a released multi-stage open data pipeline.
citing papers explorer
No citing papers match the current filters.