ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.
Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings
4 Pith papers cite this work. Polarity classification is still indexing.
fields
eess.AS 4representative citing papers
LLM-based multi-talker ASR with dual-encoder, feature interleaving, length-aware speaker loss, and adaptive ASR threshold achieves 18% and 24% relative gains over baselines on AliMeeting and Aishell4.
SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISHELL-4, and AMI.
A survey of spatial speech perception systems covering sound source localization, directional enhancement, and automatic speech recognition methods and their integration.
citing papers explorer
-
Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition
LLM-based multi-talker ASR with dual-encoder, feature interleaving, length-aware speaker loss, and adaptive ASR threshold achieves 18% and 24% relative gains over baselines on AliMeeting and Aishell4.
-
SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription
SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISHELL-4, and AMI.
-
Spatial Speech Perception Systems: A Survey of Sound Source Localization, Directional Enhancement, and Speech Recognition
A survey of spatial speech perception systems covering sound source localization, directional enhancement, and automatic speech recognition methods and their integration.