MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
hub
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.
EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.
ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.
AgentSteerTTS proposes a multi-agent framework with adversarial disentanglement, dual-stream anchoring via acoustic prototypes, and fast-slow feedback to achieve intent-faithful expressive TTS for composite instructions.
RoboKA is a KAN-based multimodal fusion model that outperforms baselines on a new synthetic dataset for detecting adversarial robocalls via acoustic and linguistic cues.
ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
WAND adapts AR-TTS models to constant complexity via windowed attention and distillation, cutting KV cache memory by up to 66.2% while preserving quality and achieving length-invariant latency.
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.
citing papers explorer
-
AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.