Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings

· 2004 · arXiv 2004.09249

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

eess.AS · 2025-07-12 · conditional · novelty 6.0

ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

eess.AS · 2026-06-11 · unverdicted · novelty 4.0

LLM-based multi-talker ASR with dual-encoder, feature interleaving, length-aware speaker loss, and adaptive ASR threshold achieves 18% and 24% relative gains over baselines on AliMeeting and Aishell4.

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

eess.AS · 2026-06-01 · unverdicted · novelty 4.0

SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISHELL-4, and AMI.

Spatial Speech Perception Systems: A Survey of Sound Source Localization, Directional Enhancement, and Speech Recognition

eess.AS · 2026-07-02 · unverdicted · novelty 2.0

A survey of spatial speech perception systems covering sound source localization, directional enhancement, and automatic speech recognition methods and their integration.

citing papers explorer

Showing 1 of 1 citing paper after filters.

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching eess.AS · 2025-07-12 · conditional · none · ref 44
ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.

Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings

fields

years

verdicts

representative citing papers

citing papers explorer