Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization
Pith reviewed 2026-05-19 14:34 UTC · model grok-4.3
The pith
Synthetic conversational data approaches real-data baselines and mixing both yields substantial gains for multi-talker ASR and speaker diarization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Optimal simulation recipes are highly task-dependent: increasing speech overlap benefits ASR but degrades diarization. Broad source diversity consistently outperforms exact domain matching. Synthetic-only training approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training across both tasks.
What carries the argument
FastMSS, a highly efficient open-source simulator for generating synthetic multi-speaker mixtures, used to analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies.
If this is right
- Increasing speech overlap improves multi-talker ASR but degrades speaker diarization performance.
- Broad source diversity for simulation works better than exact domain matching.
- Synthetic-only training nearly matches real-data baselines for both tasks.
- Combining synthetic and real data produces clear gains over real-only training on ASR and diarization.
Where Pith is reading between the lines
- Simulation parameters may need separate tuning for recognition versus diarization systems rather than a single recipe.
- Wider use of such simulators could reduce dependence on scarce real conversational recordings for system development.
- The task-dependent findings could guide simulation design for related audio processing problems like noise-robust speech separation.
Load-bearing premise
The simulation choices and acoustic augmentations in FastMSS produce mixtures whose statistical properties are close enough to real conversational recordings that performance trends will transfer.
What would settle it
A test set of real multi-talker recordings where models trained on the best synthetic mixtures underperform models trained only on real data would call the main claims into question.
Figures
read the original abstract
Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixtures and real-world interactions, we present a study of synthetic data generation for leading MT-ASR (DiCoW) and SD (Sortformer) systems. By introducing FastMSS, a highly efficient open-source simulator, we analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies. Our findings reveal that optimal simulation recipes are highly task-dependent: increasing speech overlap benefits ASR but degrades diarization. Furthermore, broad source diversity consistently outperforms exact domain matching. Ultimately, synthetic-only training approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training across both tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FastMSS, an efficient open-source simulator for synthetic conversational mixtures, and systematically studies the impact of simulation choices (turn-taking/overlap, source domain diversity, acoustic augmentations, and mixing strategies) on multi-talker ASR using DiCoW and speaker diarization using Sortformer. Key claims are that optimal recipes are task-dependent (overlap helps ASR but hurts diarization), broad source diversity outperforms exact domain matching, synthetic-only training approaches real-data baselines, and mixing synthetic with real data produces substantial gains over real-only training for both tasks.
Significance. If the reported trends hold after controlling for confounds, the work offers practical guidance on synthetic data generation for conversational speech tasks where real recordings are scarce. Strengths include the open-source release of FastMSS and the explicit comparison of task-specific simulation effects; these could help researchers prioritize broad diversity and overlap tuning when augmenting training sets for MT-ASR and diarization systems.
major comments (2)
- The central claim that mixing synthetic data with real recordings yields substantial gains (and that synthetic-only approaches real baselines) requires that performance differences arise from the statistical properties of FastMSS mixtures rather than simply increased total training data volume. The manuscript does not describe volume-matched real-only baselines or equivalent augmentations applied to the real data; without such controls the observed gains cannot be unambiguously attributed to the simulation choices.
- No error bars, statistical significance tests, or details on the number of experimental runs are reported for the performance trends summarized in the abstract. This makes it difficult to assess whether the claimed 'substantial gains' and task-dependent effects are reliable or could be explained by run-to-run variability.
minor comments (1)
- The description of FastMSS could include a brief pseudocode or parameter table to clarify how turn-taking, source selection, and augmentations are implemented, aiding reproducibility even though the code is open-source.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the work.
read point-by-point responses
-
Referee: The central claim that mixing synthetic data with real recordings yields substantial gains (and that synthetic-only approaches real baselines) requires that performance differences arise from the statistical properties of FastMSS mixtures rather than simply increased total training data volume. The manuscript does not describe volume-matched real-only baselines or equivalent augmentations applied to the real data; without such controls the observed gains cannot be unambiguously attributed to the simulation choices.
Authors: We agree that this is an important control to isolate the contribution of FastMSS simulation choices. The current experiments compare synthetic-only, real-only, and mixed conditions using the full available real data volume without explicit volume matching or equivalent real-data augmentations. In the revised manuscript we will add volume-matched real-only baselines (by subsampling or applying comparable augmentations to the real data to equalize total training hours) and report the corresponding results. This will allow us to attribute performance differences more clearly to the statistical properties of the synthetic mixtures. revision: yes
-
Referee: No error bars, statistical significance tests, or details on the number of experimental runs are reported for the performance trends summarized in the abstract. This makes it difficult to assess whether the claimed 'substantial gains' and task-dependent effects are reliable or could be explained by run-to-run variability.
Authors: We acknowledge that reporting variability and statistical significance would improve confidence in the reported trends. The experiments presented were performed as single runs per configuration, primarily due to the substantial computational cost of training DiCoW and Sortformer models. In the revision we will rerun the key configurations (synthetic-only, real-only, and mixed) across multiple random seeds, report error bars or standard deviations, and include statistical significance tests for the main claims regarding substantial gains and task-dependent effects. revision: yes
Circularity Check
Empirical comparisons independent of fitted parameters or load-bearing self-citations
full rationale
The paper conducts an empirical study comparing synthetic training data generated via the introduced FastMSS simulator against held-out real-data baselines for MT-ASR and speaker diarization. No equations, derivations, or fitted parameters are described that would reduce the reported gains (synthetic-only approaching real baselines, or mixing yielding substantial improvements) to quantities defined by the same inputs. Simulation choices and augmentations are presented as experimental variables whose effects are measured externally, with no self-citation chains or uniqueness theorems invoked to justify core claims. This yields a minor score for possible incidental self-citations on baseline systems but keeps the central results self-contained against real recordings.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic mixtures generated by FastMSS and chosen augmentations capture the relevant acoustic and turn-taking statistics of real multi-talker conversations
invented entities (1)
-
FastMSS simulator
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FastMSS Turn-Taking (TT) model simply extends the two-speaker HMM-based approach... Four utterance transition types... overlap extent (IR) is drawn as a ratio from a truncated exponential
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
optimal simulation recipes are task-dependent: boosting overlap improves MT-ASR but degrades diarization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction Multi-talker conversational speech processing is undergoing a rapid transformation, driven largely by the shift from highly specialized pipelines to less data-hungry methods built on pretrained foundation models [1–4]. By leveraging massive amounts of single-speaker or self-supervised data, these foun- dational backbones can be effectively fi...
-
[2]
Multi-Talker Speech Processing 2.1. Multi-Talker ASR: DiCoW Multi-talker ASR (MT-ASR) has traditionally been tackled through modular separation-based pipelines [32], end-to-end architectures like Serialized Output Training (SOT) [33], or target-speaker conditioning [34]. The latter paradigm has seen rapid advancement with the introduction of diarization c...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Multi-Speaker Conversation Simulation To enable controlled and fast experimentation along the axes described above, we developed FastMSS, an open-source multi- speaker conversation simulator focused on scalable generation with native Lhotse [46] integration. Given a set of single- speaker utterances from a source dataset, FastMSS generates multi-talker mi...
-
[4]
Experimental Setup 4.1. Datasets As source domains for synthetic generation, we use: Lib- riSpeech [49] (read speech, 960h), V oxPopuli [50] (semi- spontaneous parliamentary speech, 543h), otoSpeech [51] (full- duplex conversational speech, 141h), and the close-talk chan- nels of AMI Meeting Corpus [11] and NOTSOFAR-1 (NSF-
-
[5]
[12] (spontaneous meetings). All datasets were re-aligned using the Montreal Forced Aligner [52] to ensure consistent word-level timestamps. Noises for data augmentation are taken from the MUSAN [53], with “speech” noises excluded. For DiCoW, we evaluate primarily on AMI Single Dis- tant Microphone (SDM) and NSF-1 [12] Single-Channel (SC), alongside Libri...
-
[6]
Results 5.1. Impact of Turn-Taking Dynamics In Table 1, we isolate the effect of turn-taking by varying only the simulator transition model parameters while keeping all other factors fixed. For DiCoW, the source utterances are NSF-1 close-talk (∼500h simulated from∼7.5h); for Sortformer, Lib- riSpeech (2,000h simulated from 960h), without augmentation. Fo...
-
[7]
Conclusions We presented a systematic study of synthetic conversational data for multi-talker speech processing, investigating the impact of turn-taking dynamics, source domain, and data combination strategies on target-speaker ASR (DiCoW) and speaker diariza- tion (Sortformer). Our main findings are fourfold: (i) optimal simulation recipes are task-depen...
-
[8]
Acknowledgements This work was partially conducted at the 2025 JSALT work- shop. Support was provided by the Ministry of Education, Youth and Sports of the Czech Republic (MoE) through the OP JAK project “Linguistics, Artificial Intelligence and Lan- guage and Speech Technologies: from Research to Applica- tions” (ID:CZ.02.01.01/00/23 020/0008518), and Br...
work page 2025
-
[9]
Generative AI Use Disclosure Generative AI tools have only been used to help revise the manuscript
-
[10]
Robust speech recognition via large-scale weak supervision,
A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
work page 2023
-
[11]
WavLM: Large-scale self-supervised pre-training for full stack speech processing,
S. Chenet al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
work page 2022
-
[12]
Scaling speech technology to 1,000+ languages,
V . Pratapet al., “Scaling speech technology to 1,000+ languages,” J. Mach. Learn. Res., vol. 25, no. 1, Jan. 2024
work page 2024
-
[13]
Google USM: Scaling automatic speech recognition beyond 100 languages,
Y . Zhanget al., “Google USM: Scaling automatic speech recognition beyond 100 languages,” 2023. [Online]. Available: https://arxiv.org/abs/2303.01037
-
[14]
Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,
Z. Huang, D. Raj, P. Garc ´ıa, and S. Khudanpur, “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” inProc. of ICASSP, 2023, pp. 1–5
work page 2023
-
[15]
Adapting multi-lingual ASR models for handling multiple talkers,
C. Liet al., “Adapting multi-lingual ASR models for handling multiple talkers,” inProc. of Interspeech, 2023, pp. 1314–1318
work page 2023
-
[16]
J. Hanet al., “Fine-tune before structured pruning: Towards com- pact and accurate self-supervised models for speaker diarization,” inProc. of Interspeech, 2025, pp. 1583–1587
work page 2025
-
[17]
T. Parket al., “Sortformer: A novel approach for permutation- resolved speaker supervision in speech-to-text systems,” in International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=AyYjRvrbDx
work page 2025
-
[18]
Beyond turn-based interfaces: Synchronous LLMs as full- duplex dialogue agents,
B. Veluri, B. N. Peloquin, B. Yu, H. Gong, and S. Gollakota, “Beyond turn-based interfaces: Synchronous LLMs as full- duplex dialogue agents,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al- Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024,...
work page 2024
-
[19]
Moshi: a speech-text foundation model for real-time dialogue
A. D ´efossezet al., “Moshi: a speech-text foundation model for real-time dialogue,” 2024. [Online]. Available: https: //arxiv.org/abs/2410.00037
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
I. Mccowanet al., “The AMI meeting corpus,”Int’l. Conf. on Methods and Techniques in Behavioral Research, 01 2005
work page 2005
-
[21]
NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,
A. Vinnikovet al., “NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,” inProc. of Interspeech, 2024, pp. 5003–5007
work page 2024
-
[22]
Summary of the NOTSOFAR-1 challenge: Highlights and learnings,
I. Abramovskiet al., “Summary of the NOTSOFAR-1 challenge: Highlights and learnings,”Computer Speech & Language, vol. 93, p. 101796, 2025
work page 2025
-
[23]
A cocktail-party benchmark: Multi-modal dataset and comparative evaluation results,
T.-B. Nguyenet al., “A cocktail-party benchmark: Multi-modal dataset and comparative evaluation results,” 2026. [Online]. Available: https://arxiv.org/abs/2510.23276
-
[24]
CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,
S. Watanabeet al., “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” inProc. of CHiME, 2020, pp. 1–7
work page 2020
-
[25]
The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,
J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” inInterspeech 2018, 2018, pp. 1561–1565
work page 2018
-
[26]
MMS-MSG: A multi-purpose multi-speaker mixture signal generator,
T. Cord-Landwehr, T. von Neumann, C. Boeddeker, and R. Haeb- Umbach, “MMS-MSG: A multi-purpose multi-speaker mixture signal generator,” inInternational Workshop on Acoustic Signal Enhancement, 2022, pp. 1–5
work page 2022
-
[27]
T. J. Parket al., “Property-aware multi-speaker data simulation: A probabilistic modelling technique for synthetic data generation,” inProc. of CHiME, 2023, pp. 82–86
work page 2023
-
[28]
S. Cornell, J. Darefsky, Z. Duan, and S. Watanabe, “Generating data with text-to-speech and large-language models for conver- sational speech recognition,” inProc. SynData4GenAI, 2024, pp. 6–10
work page 2024
-
[29]
S. Burdissoet al., “SDialog: A Python toolkit for end-to- end agent building, user simulation, dialog generation, and evaluation,” 2025. [Online]. Available: https://arxiv.org/abs/2506. 10622
work page 2025
-
[30]
V oicebox: text-guided multilingual universal speech generation at scale,
M. Leet al., “V oicebox: text-guided multilingual universal speech generation at scale,” inInternational Conference on Neural Infor- mation Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023
work page 2023
-
[31]
F. Landini, A. Lozano-Diez, M. Diez, and L. Burget, “From sim- ulated mixtures to simulated conversations as training data for end-to-end neural diarization,” inProc. of Interspeech, 2022, pp. 5095–5099
work page 2022
-
[32]
Pushing the limits of end- to-end diarization,
S. J. Broughton and L. Samarakoon, “Pushing the limits of end- to-end diarization,” inProc. of Interspeech, 2025, pp. 5218–5222
work page 2025
-
[33]
Simulating realistic speech overlaps improves multi-talker ASR,
M. Yanget al., “Simulating realistic speech overlaps improves multi-talker ASR,” inProc. of ICASSP, 2023, pp. 1–5
work page 2023
-
[34]
Synthetic conversations improve multi-talker ASR,
T.-B. Nguyen and A. Waibel, “Synthetic conversations improve multi-talker ASR,” inProc. of ICASSP, 2024, pp. 10 461–10 465
work page 2024
-
[35]
Can synthetic speech improve end-to- end conversational speech translation?
B. Bamfo Odoomet al., “Can synthetic speech improve end-to- end conversational speech translation?” inProceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), R. Knowles, A. Eriguchi, and S. Goel, Eds. Chicago, USA: Association for Machine Translation in the Americas, Sep. 2024, pp. 167–177....
work page 2024
-
[36]
B. Hilmes, N. Rossenbach, and R. Schl ¨uter, “On the effect of purely synthetic training data for different automatic speech recognition architectures,” inProc. of SynData4GenAI, 2024, pp. 46–50
work page 2024
-
[37]
Continuous speech separation: Dataset and anal- ysis,
Z. Chenet al., “Continuous speech separation: Dataset and anal- ysis,” inProc. of ICASSP, 2020, pp. 7284–7288
work page 2020
-
[38]
arXiv preprint arXiv:2005.11262 , year=
J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An open-source dataset for generalizable speech separation,” 2020. [Online]. Available: https://arxiv.org/ abs/2005.11262
-
[39]
End-to-end neural speaker diarization with self- attention,
Y . Fujitaet al., “End-to-end neural speaker diarization with self- attention,” in2019 IEEE Automatic Speech Recognition and Un- derstanding Workshop (ASRU), 2019, pp. 296–303
work page 2019
-
[40]
DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,
A. Poloket al., “DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,”Computer Speech & Language, vol. 95, p. 101841, 2026. [Online]. Available: https:// www.sciencedirect.com/science/article/pii/S088523082500066X
work page 2026
-
[41]
The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,
S. Niuet al., “The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,” inProc. of CHiME, 2024, pp. 31–36
work page 2024
-
[42]
Seri- alized output training for end-to-end overlapped speech recogni- tion,
N. Kanda, Y . Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Seri- alized output training for end-to-end overlapped speech recogni- tion,” inProc. of Interspeech, 2020, pp. 2797–2801
work page 2020
-
[43]
Auxiliary interference speaker loss for target- speaker speech recognition,
N. Kandaet al., “Auxiliary interference speaker loss for target- speaker speech recognition,” inProc. of Interspeech, 2019, pp. 236–240
work page 2019
-
[44]
Target speaker ASR with Whisper,
A. Poloket al., “Target speaker ASR with Whisper,” inProc. of ICASSP, 2025, pp. 1–5
work page 2025
-
[45]
SE-DiCoW: Self-enrolled diarization-conditioned Whis- per,
——, “SE-DiCoW: Self-enrolled diarization-conditioned Whis- per,” inProc. of ICASSP, 2026
work page 2026
-
[46]
Pyannote.audio: Neural building blocks for speaker diarization,
H. Bredinet al., “Pyannote.audio: Neural building blocks for speaker diarization,” inICASSP 2020 - 2020 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7124–7128
work page 2020
-
[47]
Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,
K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,” inProc. of ICASSP, 2021, pp. 7198–7202
work page 2021
-
[48]
End-to-end speaker segmentation for overlap-aware resegmentation,
H. Bredin and A. Laurent, “End-to-end speaker segmentation for overlap-aware resegmentation,” inInterspeech 2021, 2021, pp. 3111–3115
work page 2021
-
[49]
Powerset multi-class cross entropy loss for neural speaker diarization,
A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” inProc. of Interspeech, 2023, pp. 3222–3226
work page 2023
-
[50]
Streaming Sortformer: Speaker cache- based online speaker diarization with arrival-time ordering,
I. Medennikovet al., “Streaming Sortformer: Speaker cache- based online speaker diarization with arrival-time ordering,” in Proc. of Interspeech, 2025, pp. 5238–5242
work page 2025
-
[51]
Encoder-decoder based attractors for end-to-end neural di- arization,
S. Horiguchi, Y . Fujita, S. Watanabe, Y . Xue, and P. Garc ´ıa, “Encoder-decoder based attractors for end-to-end neural di- arization,”IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 30, p. 1493–1507, Mar. 2022. [Online]. Available: https://doi.org/10.1109/TASLP.2022.3162080
-
[52]
Online neural diarization of unlimited numbers of speakers using global and local attractors,
S. Horiguchi, S. Watanabe, P. Garc ´ıa, Y . Takashima, and Y . Kawaguchi, “Online neural diarization of unlimited numbers of speakers using global and local attractors,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 31, pp. 706–720, 2022
work page 2022
-
[53]
NEST: Self-supervised Fast Conformer as all-purpose seasoning to speech processing tasks,
H. Huanget al., “NEST: Self-supervised Fast Conformer as all-purpose seasoning to speech processing tasks,” inProc. of ICASSP, 2025, pp. 1–5
work page 2025
-
[54]
Fast Conformer with linearly scalable attention for efficient speech recognition,
D. Rekeshet al., “Fast Conformer with linearly scalable attention for efficient speech recognition,” inProc. of ASRU, 2023, pp. 1–8
work page 2023
-
[55]
Lhotse: A speech data representation library for the modern deep learning ecosystem,
P. ˙Zelasko, D. Povey, J. Y . Trmal, and S. Khudanpur, “Lhotse: A speech data representation library for the modern deep learning ecosystem,” 2021. [Online]. Available: https: //arxiv.org/abs/2110.12561
-
[56]
Pyroomacoustics: A Python package for audio room simulation and array processing algorithms,
R. Scheibler, E. Bezzam, and I. Dokmani ´c, “Pyroomacoustics: A Python package for audio room simulation and array processing algorithms,” inProc. of ICASSP, 2018, pp. 351–355
work page 2018
-
[57]
Improving the naturalness of simulated conversations for end-to-end neural di- arization,
N. Yamashita, S. Horiguchi, and T. Homma, “Improving the naturalness of simulated conversations for end-to-end neural di- arization,” inThe Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 133–140
work page 2022
-
[58]
Lib- rispeech: An ASR corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inProc. of ICASSP, 2015, pp. 5206–5210
work page 2015
-
[59]
C. Wanget al., “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li,...
work page 2021
-
[60]
otospeech-full-duplex-processed-141h: Full-duplex conversational speech dataset,
otoearth, “otospeech-full-duplex-processed-141h: Full-duplex conversational speech dataset,” https://huggingface.co/datasets/ otoearth/otoSpeech-full-duplex-processed-141h, 2026, license: CC BY 4.0
work page 2026
-
[61]
Montreal forced aligner: Trainable text-speech align- ment using Kaldi,
M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal forced aligner: Trainable text-speech align- ment using Kaldi,” inProc. of Interspeech, 2017, pp. 498–502
work page 2017
-
[62]
MUSAN: A Music, Speech, and Noise Corpus
D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” 2015, arXiv:1510.08484v1
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[63]
N. Kandaet al., “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” inProc. of Interspeech, 2020, pp. 36–40
work page 2020
- [64]
-
[65]
M2MeT: The ICASSP 2022 multi-channel multi- party meeting transcription challenge,
F. Yuet al., “M2MeT: The ICASSP 2022 multi-channel multi- party meeting transcription challenge,” inProc. of ICASSP. IEEE, 2022
work page 2022
-
[66]
Summary on the ICASSP 2022 multi-channel multi- party meeting transcription grand challenge,
——, “Summary on the ICASSP 2022 multi-channel multi- party meeting transcription grand challenge,” inProc. of ICASSP. IEEE, 2022
work page 2022
-
[67]
The third DIHARD diarization challenge,
N. Ryantet al., “The third DIHARD diarization challenge,” in Proc. of Interspeech, 2021, pp. 3570–3574
work page 2021
-
[68]
MSDWild: Multi-modal speaker diarization dataset in the wild,
T. Liuet al., “MSDWild: Multi-modal speaker diarization dataset in the wild,” inProc. of Interspeech, 2022, pp. 1476–1480
work page 2022
-
[69]
Can we really repurpose multi-speaker ASR corpus for speaker diarization?
S. Horiguchi, N. Tawara, T. Ashihara, A. Ando, and M. Delcroix, “Can we really repurpose multi-speaker ASR corpus for speaker diarization?” inProc. of ASRU, Dec 2025
work page 2025
-
[70]
Spot the conversation: speaker diarisation in the wild,
J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman, “Spot the conversation: speaker diarisation in the wild,” inProc. of Interspeech, 2020, pp. 299–303
work page 2020
-
[71]
MeetEval: A toolkit for computation of word error rates for meeting transcription systems,
T. v. Neumann, C. B. Boeddeker, M. Delcroix, and R. Haeb- Umbach, “MeetEval: A toolkit for computation of word error rates for meeting transcription systems,” inProc. of CHiME, 2023, pp. 27–32
work page 2023
-
[72]
NeMo: a toolkit for building AI applications using neural modules,
O. Kuchaievet al., “NeMo: a toolkit for building AI applications using neural modules,”arXiv preprint arXiv:1909.09577, 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.