3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

Hui Wang; Luyao Cheng; Qian Chen; Siqi Zheng; Yafeng Chen

arxiv: 2306.15354 · v3 · pith:4DAMEIXAnew · submitted 2023-06-27 · 💻 cs.CL · cs.SD· eess.AS

3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

Siqi Zheng , Luyao Cheng , Yafeng Chen , Hui Wang , Qian Chen This is my paper

classification 💻 cs.CL cs.SDeess.AS

keywords speechd-speakerrepresentationcorpusdifferentdisentanglementinformationlarge-scale

0 comments

read the original abstract

Disentangling uncorrelated information in speech utterances is a crucial research topic within speech community. Different speech-related tasks focus on extracting distinct speech representations while minimizing the affects of other uncorrelated information. We present a large-scale speech corpus to facilitate the research of speech representation disentanglement. 3D-Speaker contains over 10,000 speakers, each of whom are simultaneously recorded by multiple Devices, locating at different Distances, and some speakers are speaking multiple Dialects. The controlled combinations of multi-dimensional audio data yield a matrix of a diverse blend of speech representation entanglement, thereby motivating intriguing methods to untangle them. The multi-domain nature of 3D-Speaker also makes it a suitable resource to evaluate large universal speech models and experiment methods of out-of-domain learning and self-supervised learning. https://3dspeaker.github.io/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
cs.SD 2026-04 unverdicted novelty 6.0

Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription
eess.AS 2026-06 unverdicted novelty 4.0

SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISH...
Kiwano: A Cutting-Edge Open-Source Toolkit for Speaker Verification
cs.SD 2026-06 unverdicted novelty 3.0

Kiwano is an open-source toolkit that supplies standardized PyTorch pipelines, pretrained models, and evaluation protocols for speaker verification.