hub

Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system

Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang · 2025 · arXiv 2502.05512

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

eess.AS · 2026-04-20 · unverdicted · novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.

EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

eess.AS · 2025-10-22 · unverdicted · novelty 7.0

EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.

SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis

eess.AS · 2026-05-16 · unverdicted · novelty 6.0

SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

cs.CL · 2026-04-01 · unverdicted · novelty 6.0

OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

eess.AS · 2025-07-12 · conditional · novelty 6.0

ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

cs.SD · 2025-05-23 · unverdicted · novelty 6.0

CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.

AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

AgentSteerTTS proposes a multi-agent framework with adversarial disentanglement, dual-stream anchoring via acoustic prototypes, and fast-slow feedback to achieve intent-faithful expressive TTS for composite instructions.

RoboKA: KAN Informed Multimodal Learning for RoboCall Surveillance System

cs.MM · 2026-04-30 · unverdicted · novelty 5.0

RoboKA is a KAN-based multimodal fusion model that outperforms baselines on a new synthetic dataset for detecting adversarial robocalls via acoustic and linguistic cues.

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

cs.SD · 2026-04-13 · unverdicted · novelty 5.0

ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.

Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

cs.CL · 2026-04-10 · unverdicted · novelty 5.0

The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

cs.CL · 2026-03-17 · unverdicted · novelty 5.0

WAND adapts AR-TTS models to constant complexity via windowed attention and distillation, cutting KV cache memory by up to 66.2% while preserving quality and achieving length-invariant latency.

AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

cs.SD · 2026-04-09 · unverdicted · novelty 3.0

AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.

citing papers explorer

Showing 13 of 13 citing papers.

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech eess.AS · 2026-04-20 · unverdicted · none · ref 4
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing cs.SD · 2026-04-17 · unverdicted · none · ref 20
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.
EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection eess.AS · 2025-10-22 · unverdicted · none · ref 25
EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis eess.AS · 2026-05-16 · unverdicted · none · ref 11
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models cs.CL · 2026-04-01 · unverdicted · none · ref 31
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching eess.AS · 2025-07-12 · conditional · none · ref 11
ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training cs.SD · 2025-05-23 · unverdicted · none · ref 32
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.
AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech cs.CV · 2026-05-14 · unverdicted · none · ref 108
AgentSteerTTS proposes a multi-agent framework with adversarial disentanglement, dual-stream anchoring via acoustic prototypes, and fast-slow feedback to achieve intent-faithful expressive TTS for composite instructions.
RoboKA: KAN Informed Multimodal Learning for RoboCall Surveillance System cs.MM · 2026-04-30 · unverdicted · none · ref 15
RoboKA is a KAN-based multimodal fusion model that outperforms baselines on a new synthetic dataset for detecting adversarial robocalls via acoustic and linguistic cues.
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing cs.SD · 2026-04-13 · unverdicted · none · ref 19
ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition cs.CL · 2026-04-10 · unverdicted · none · ref 36
The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models cs.CL · 2026-03-17 · unverdicted · none · ref 10
WAND adapts AR-TTS models to constant complexity via windowed attention and distillation, cutting KV cache memory by up to 66.2% while preserving quality and achieving length-invariant latency.
AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan cs.SD · 2026-04-09 · unverdicted · none · ref 9
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.

Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer