ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.
Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings
4 Pith papers cite this work. Polarity classification is still indexing.
fields
eess.AS 4representative citing papers
LLM-based multi-talker ASR with dual-encoder, feature interleaving, length-aware speaker loss, and adaptive ASR threshold achieves 18% and 24% relative gains over baselines on AliMeeting and Aishell4.
SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISHELL-4, and AMI.
A survey of spatial speech perception systems covering sound source localization, directional enhancement, and automatic speech recognition methods and their integration.
citing papers explorer
-
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.