pith. sign in

arxiv: 2605.15442 · v1 · submitted 2026-05-14 · 📡 eess.AS

Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization

Pith reviewed 2026-05-19 14:34 UTC · model grok-4.3

classification 📡 eess.AS
keywords multi-talker ASRspeaker diarizationsynthetic dataconversational speechFastMSSspeech overlapsource diversitydata mixing
0
0 comments X p. Extension

The pith

Synthetic conversational data approaches real-data baselines and mixing both yields substantial gains for multi-talker ASR and speaker diarization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how specific choices in generating synthetic conversational audio affect training for multi-talker automatic speech recognition and speaker diarization. It introduces FastMSS, an efficient open-source simulator, to test the effects of speech overlap, source domain diversity, acoustic augmentation, and mixing strategies. Results show that optimal simulation depends on the task, with more overlap helping recognition but hurting diarization, and broad source variety beating exact domain matches. Synthetic-only training nearly reaches real-data performance, but the largest improvements come from combining synthetic and real recordings.

Core claim

Optimal simulation recipes are highly task-dependent: increasing speech overlap benefits ASR but degrades diarization. Broad source diversity consistently outperforms exact domain matching. Synthetic-only training approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training across both tasks.

What carries the argument

FastMSS, a highly efficient open-source simulator for generating synthetic multi-speaker mixtures, used to analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies.

If this is right

  • Increasing speech overlap improves multi-talker ASR but degrades speaker diarization performance.
  • Broad source diversity for simulation works better than exact domain matching.
  • Synthetic-only training nearly matches real-data baselines for both tasks.
  • Combining synthetic and real data produces clear gains over real-only training on ASR and diarization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Simulation parameters may need separate tuning for recognition versus diarization systems rather than a single recipe.
  • Wider use of such simulators could reduce dependence on scarce real conversational recordings for system development.
  • The task-dependent findings could guide simulation design for related audio processing problems like noise-robust speech separation.

Load-bearing premise

The simulation choices and acoustic augmentations in FastMSS produce mixtures whose statistical properties are close enough to real conversational recordings that performance trends will transfer.

What would settle it

A test set of real multi-talker recordings where models trained on the best synthetic mixtures underperform models trained only on real data would call the main claims into question.

Figures

Figures reproduced from arXiv: 2605.15442 by Alexander Polok, Ivan Medennikov, Jan \v{C}ernock\'y, Luk\'a\v{s} Burget, Samuele Cornell, Shinji Watanabe.

Figure 1
Figure 1. Figure 1: demonstrates FastMSS’s scalability against two widely used simulators, MMS-MSG [17] and the NeMo multi-speaker data simulator [18], benchmarked on iden￾tical hardware (4× AMD EPYC 7742, 256 CPUs, 1 TB RAM) generating 6,000 two-minute meetings from LibriSpeech train-clean-100 without reverberation or noise. FastMSS scales efficiently from single up to 32 processes, generating 1,000 hours of annotated multi-… view at source ↗
read the original abstract

Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixtures and real-world interactions, we present a study of synthetic data generation for leading MT-ASR (DiCoW) and SD (Sortformer) systems. By introducing FastMSS, a highly efficient open-source simulator, we analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies. Our findings reveal that optimal simulation recipes are highly task-dependent: increasing speech overlap benefits ASR but degrades diarization. Furthermore, broad source diversity consistently outperforms exact domain matching. Ultimately, synthetic-only training approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training across both tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FastMSS, an efficient open-source simulator for synthetic conversational mixtures, and systematically studies the impact of simulation choices (turn-taking/overlap, source domain diversity, acoustic augmentations, and mixing strategies) on multi-talker ASR using DiCoW and speaker diarization using Sortformer. Key claims are that optimal recipes are task-dependent (overlap helps ASR but hurts diarization), broad source diversity outperforms exact domain matching, synthetic-only training approaches real-data baselines, and mixing synthetic with real data produces substantial gains over real-only training for both tasks.

Significance. If the reported trends hold after controlling for confounds, the work offers practical guidance on synthetic data generation for conversational speech tasks where real recordings are scarce. Strengths include the open-source release of FastMSS and the explicit comparison of task-specific simulation effects; these could help researchers prioritize broad diversity and overlap tuning when augmenting training sets for MT-ASR and diarization systems.

major comments (2)
  1. The central claim that mixing synthetic data with real recordings yields substantial gains (and that synthetic-only approaches real baselines) requires that performance differences arise from the statistical properties of FastMSS mixtures rather than simply increased total training data volume. The manuscript does not describe volume-matched real-only baselines or equivalent augmentations applied to the real data; without such controls the observed gains cannot be unambiguously attributed to the simulation choices.
  2. No error bars, statistical significance tests, or details on the number of experimental runs are reported for the performance trends summarized in the abstract. This makes it difficult to assess whether the claimed 'substantial gains' and task-dependent effects are reliable or could be explained by run-to-run variability.
minor comments (1)
  1. The description of FastMSS could include a brief pseudocode or parameter table to clarify how turn-taking, source selection, and augmentations are implemented, aiding reproducibility even though the code is open-source.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the work.

read point-by-point responses
  1. Referee: The central claim that mixing synthetic data with real recordings yields substantial gains (and that synthetic-only approaches real baselines) requires that performance differences arise from the statistical properties of FastMSS mixtures rather than simply increased total training data volume. The manuscript does not describe volume-matched real-only baselines or equivalent augmentations applied to the real data; without such controls the observed gains cannot be unambiguously attributed to the simulation choices.

    Authors: We agree that this is an important control to isolate the contribution of FastMSS simulation choices. The current experiments compare synthetic-only, real-only, and mixed conditions using the full available real data volume without explicit volume matching or equivalent real-data augmentations. In the revised manuscript we will add volume-matched real-only baselines (by subsampling or applying comparable augmentations to the real data to equalize total training hours) and report the corresponding results. This will allow us to attribute performance differences more clearly to the statistical properties of the synthetic mixtures. revision: yes

  2. Referee: No error bars, statistical significance tests, or details on the number of experimental runs are reported for the performance trends summarized in the abstract. This makes it difficult to assess whether the claimed 'substantial gains' and task-dependent effects are reliable or could be explained by run-to-run variability.

    Authors: We acknowledge that reporting variability and statistical significance would improve confidence in the reported trends. The experiments presented were performed as single runs per configuration, primarily due to the substantial computational cost of training DiCoW and Sortformer models. In the revision we will rerun the key configurations (synthetic-only, real-only, and mixed) across multiple random seeds, report error bars or standard deviations, and include statistical significance tests for the main claims regarding substantial gains and task-dependent effects. revision: yes

Circularity Check

0 steps flagged

Empirical comparisons independent of fitted parameters or load-bearing self-citations

full rationale

The paper conducts an empirical study comparing synthetic training data generated via the introduced FastMSS simulator against held-out real-data baselines for MT-ASR and speaker diarization. No equations, derivations, or fitted parameters are described that would reduce the reported gains (synthetic-only approaching real baselines, or mixing yielding substantial improvements) to quantities defined by the same inputs. Simulation choices and augmentations are presented as experimental variables whose effects are measured externally, with no self-citation chains or uniqueness theorems invoked to justify core claims. This yields a minor score for possible incidental self-citations on baseline systems but keeps the central results self-contained against real recordings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claims rest on the domain assumption that synthetic mixtures generated under the tested conditions are sufficiently representative of real conversational statistics to support the reported performance ordering.

axioms (1)
  • domain assumption Synthetic mixtures generated by FastMSS and chosen augmentations capture the relevant acoustic and turn-taking statistics of real multi-talker conversations
    Invoked when claiming that synthetic-only training approaches real-data baselines
invented entities (1)
  • FastMSS simulator no independent evidence
    purpose: Efficient generation of synthetic multi-speaker mixtures for controlled experimentation
    New tool introduced to enable the parameter sweeps described

pith-pipeline@v0.9.0 · 5710 in / 1195 out tokens · 58023 ms · 2026-05-19T14:34:17.220256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 3 internal anchors

  1. [1]

    in-the-wild

    Introduction Multi-talker conversational speech processing is undergoing a rapid transformation, driven largely by the shift from highly specialized pipelines to less data-hungry methods built on pretrained foundation models [1–4]. By leveraging massive amounts of single-speaker or self-supervised data, these foun- dational backbones can be effectively fi...

  2. [2]

    Multi-Talker Speech Processing 2.1. Multi-Talker ASR: DiCoW Multi-talker ASR (MT-ASR) has traditionally been tackled through modular separation-based pipelines [32], end-to-end architectures like Serialized Output Training (SOT) [33], or target-speaker conditioning [34]. The latter paradigm has seen rapid advancement with the introduction of diarization c...

  3. [3]

    Multi-Speaker Conversation Simulation To enable controlled and fast experimentation along the axes described above, we developed FastMSS, an open-source multi- speaker conversation simulator focused on scalable generation with native Lhotse [46] integration. Given a set of single- speaker utterances from a source dataset, FastMSS generates multi-talker mi...

  4. [4]

    Experimental Setup 4.1. Datasets As source domains for synthetic generation, we use: Lib- riSpeech [49] (read speech, 960h), V oxPopuli [50] (semi- spontaneous parliamentary speech, 543h), otoSpeech [51] (full- duplex conversational speech, 141h), and the close-talk chan- nels of AMI Meeting Corpus [11] and NOTSOFAR-1 (NSF-

  5. [5]

    All datasets were re-aligned using the Montreal Forced Aligner [52] to ensure consistent word-level timestamps

    [12] (spontaneous meetings). All datasets were re-aligned using the Montreal Forced Aligner [52] to ensure consistent word-level timestamps. Noises for data augmentation are taken from the MUSAN [53], with “speech” noises excluded. For DiCoW, we evaluate primarily on AMI Single Dis- tant Microphone (SDM) and NSF-1 [12] Single-Channel (SC), alongside Libri...

  6. [6]

    Impact of Turn-Taking Dynamics In Table 1, we isolate the effect of turn-taking by varying only the simulator transition model parameters while keeping all other factors fixed

    Results 5.1. Impact of Turn-Taking Dynamics In Table 1, we isolate the effect of turn-taking by varying only the simulator transition model parameters while keeping all other factors fixed. For DiCoW, the source utterances are NSF-1 close-talk (∼500h simulated from∼7.5h); for Sortformer, Lib- riSpeech (2,000h simulated from 960h), without augmentation. Fo...

  7. [7]

    Conclusions We presented a systematic study of synthetic conversational data for multi-talker speech processing, investigating the impact of turn-taking dynamics, source domain, and data combination strategies on target-speaker ASR (DiCoW) and speaker diariza- tion (Sortformer). Our main findings are fourfold: (i) optimal simulation recipes are task-depen...

  8. [8]

    Linguistics, Artificial Intelligence and Lan- guage and Speech Technologies: from Research to Applica- tions

    Acknowledgements This work was partially conducted at the 2025 JSALT work- shop. Support was provided by the Ministry of Education, Youth and Sports of the Czech Republic (MoE) through the OP JAK project “Linguistics, Artificial Intelligence and Lan- guage and Speech Technologies: from Research to Applica- tions” (ID:CZ.02.01.01/00/23 020/0008518), and Br...

  9. [9]

    Generative AI Use Disclosure Generative AI tools have only been used to help revise the manuscript

  10. [10]

    Robust speech recognition via large-scale weak supervision,

    A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  11. [11]

    WavLM: Large-scale self-supervised pre-training for full stack speech processing,

    S. Chenet al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  12. [12]

    Scaling speech technology to 1,000+ languages,

    V . Pratapet al., “Scaling speech technology to 1,000+ languages,” J. Mach. Learn. Res., vol. 25, no. 1, Jan. 2024

  13. [13]

    Google USM: Scaling automatic speech recognition beyond 100 languages,

    Y . Zhanget al., “Google USM: Scaling automatic speech recognition beyond 100 languages,” 2023. [Online]. Available: https://arxiv.org/abs/2303.01037

  14. [14]

    Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,

    Z. Huang, D. Raj, P. Garc ´ıa, and S. Khudanpur, “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” inProc. of ICASSP, 2023, pp. 1–5

  15. [15]

    Adapting multi-lingual ASR models for handling multiple talkers,

    C. Liet al., “Adapting multi-lingual ASR models for handling multiple talkers,” inProc. of Interspeech, 2023, pp. 1314–1318

  16. [16]

    Fine-tune before structured pruning: Towards com- pact and accurate self-supervised models for speaker diarization,

    J. Hanet al., “Fine-tune before structured pruning: Towards com- pact and accurate self-supervised models for speaker diarization,” inProc. of Interspeech, 2025, pp. 1583–1587

  17. [17]

    Sortformer: A novel approach for permutation- resolved speaker supervision in speech-to-text systems,

    T. Parket al., “Sortformer: A novel approach for permutation- resolved speaker supervision in speech-to-text systems,” in International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=AyYjRvrbDx

  18. [18]

    Beyond turn-based interfaces: Synchronous LLMs as full- duplex dialogue agents,

    B. Veluri, B. N. Peloquin, B. Yu, H. Gong, and S. Gollakota, “Beyond turn-based interfaces: Synchronous LLMs as full- duplex dialogue agents,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al- Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024,...

  19. [19]

    Moshi: a speech-text foundation model for real-time dialogue

    A. D ´efossezet al., “Moshi: a speech-text foundation model for real-time dialogue,” 2024. [Online]. Available: https: //arxiv.org/abs/2410.00037

  20. [20]

    The AMI meeting corpus,

    I. Mccowanet al., “The AMI meeting corpus,”Int’l. Conf. on Methods and Techniques in Behavioral Research, 01 2005

  21. [21]

    NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,

    A. Vinnikovet al., “NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,” inProc. of Interspeech, 2024, pp. 5003–5007

  22. [22]

    Summary of the NOTSOFAR-1 challenge: Highlights and learnings,

    I. Abramovskiet al., “Summary of the NOTSOFAR-1 challenge: Highlights and learnings,”Computer Speech & Language, vol. 93, p. 101796, 2025

  23. [23]

    A cocktail-party benchmark: Multi-modal dataset and comparative evaluation results,

    T.-B. Nguyenet al., “A cocktail-party benchmark: Multi-modal dataset and comparative evaluation results,” 2026. [Online]. Available: https://arxiv.org/abs/2510.23276

  24. [24]

    CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,

    S. Watanabeet al., “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” inProc. of CHiME, 2020, pp. 1–7

  25. [25]

    The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,

    J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” inInterspeech 2018, 2018, pp. 1561–1565

  26. [26]

    MMS-MSG: A multi-purpose multi-speaker mixture signal generator,

    T. Cord-Landwehr, T. von Neumann, C. Boeddeker, and R. Haeb- Umbach, “MMS-MSG: A multi-purpose multi-speaker mixture signal generator,” inInternational Workshop on Acoustic Signal Enhancement, 2022, pp. 1–5

  27. [27]

    Property-aware multi-speaker data simulation: A probabilistic modelling technique for synthetic data generation,

    T. J. Parket al., “Property-aware multi-speaker data simulation: A probabilistic modelling technique for synthetic data generation,” inProc. of CHiME, 2023, pp. 82–86

  28. [28]

    Generating data with text-to-speech and large-language models for conver- sational speech recognition,

    S. Cornell, J. Darefsky, Z. Duan, and S. Watanabe, “Generating data with text-to-speech and large-language models for conver- sational speech recognition,” inProc. SynData4GenAI, 2024, pp. 6–10

  29. [29]

    SDialog: A Python toolkit for end-to- end agent building, user simulation, dialog generation, and evaluation,

    S. Burdissoet al., “SDialog: A Python toolkit for end-to- end agent building, user simulation, dialog generation, and evaluation,” 2025. [Online]. Available: https://arxiv.org/abs/2506. 10622

  30. [30]

    V oicebox: text-guided multilingual universal speech generation at scale,

    M. Leet al., “V oicebox: text-guided multilingual universal speech generation at scale,” inInternational Conference on Neural Infor- mation Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

  31. [31]

    From sim- ulated mixtures to simulated conversations as training data for end-to-end neural diarization,

    F. Landini, A. Lozano-Diez, M. Diez, and L. Burget, “From sim- ulated mixtures to simulated conversations as training data for end-to-end neural diarization,” inProc. of Interspeech, 2022, pp. 5095–5099

  32. [32]

    Pushing the limits of end- to-end diarization,

    S. J. Broughton and L. Samarakoon, “Pushing the limits of end- to-end diarization,” inProc. of Interspeech, 2025, pp. 5218–5222

  33. [33]

    Simulating realistic speech overlaps improves multi-talker ASR,

    M. Yanget al., “Simulating realistic speech overlaps improves multi-talker ASR,” inProc. of ICASSP, 2023, pp. 1–5

  34. [34]

    Synthetic conversations improve multi-talker ASR,

    T.-B. Nguyen and A. Waibel, “Synthetic conversations improve multi-talker ASR,” inProc. of ICASSP, 2024, pp. 10 461–10 465

  35. [35]

    Can synthetic speech improve end-to- end conversational speech translation?

    B. Bamfo Odoomet al., “Can synthetic speech improve end-to- end conversational speech translation?” inProceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), R. Knowles, A. Eriguchi, and S. Goel, Eds. Chicago, USA: Association for Machine Translation in the Americas, Sep. 2024, pp. 167–177....

  36. [36]

    On the effect of purely synthetic training data for different automatic speech recognition architectures,

    B. Hilmes, N. Rossenbach, and R. Schl ¨uter, “On the effect of purely synthetic training data for different automatic speech recognition architectures,” inProc. of SynData4GenAI, 2024, pp. 46–50

  37. [37]

    Continuous speech separation: Dataset and anal- ysis,

    Z. Chenet al., “Continuous speech separation: Dataset and anal- ysis,” inProc. of ICASSP, 2020, pp. 7284–7288

  38. [38]

    arXiv preprint arXiv:2005.11262 , year=

    J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An open-source dataset for generalizable speech separation,” 2020. [Online]. Available: https://arxiv.org/ abs/2005.11262

  39. [39]

    End-to-end neural speaker diarization with self- attention,

    Y . Fujitaet al., “End-to-end neural speaker diarization with self- attention,” in2019 IEEE Automatic Speech Recognition and Un- derstanding Workshop (ASRU), 2019, pp. 296–303

  40. [40]

    DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,

    A. Poloket al., “DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,”Computer Speech & Language, vol. 95, p. 101841, 2026. [Online]. Available: https:// www.sciencedirect.com/science/article/pii/S088523082500066X

  41. [41]

    The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,

    S. Niuet al., “The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,” inProc. of CHiME, 2024, pp. 31–36

  42. [42]

    Seri- alized output training for end-to-end overlapped speech recogni- tion,

    N. Kanda, Y . Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Seri- alized output training for end-to-end overlapped speech recogni- tion,” inProc. of Interspeech, 2020, pp. 2797–2801

  43. [43]

    Auxiliary interference speaker loss for target- speaker speech recognition,

    N. Kandaet al., “Auxiliary interference speaker loss for target- speaker speech recognition,” inProc. of Interspeech, 2019, pp. 236–240

  44. [44]

    Target speaker ASR with Whisper,

    A. Poloket al., “Target speaker ASR with Whisper,” inProc. of ICASSP, 2025, pp. 1–5

  45. [45]

    SE-DiCoW: Self-enrolled diarization-conditioned Whis- per,

    ——, “SE-DiCoW: Self-enrolled diarization-conditioned Whis- per,” inProc. of ICASSP, 2026

  46. [46]

    Pyannote.audio: Neural building blocks for speaker diarization,

    H. Bredinet al., “Pyannote.audio: Neural building blocks for speaker diarization,” inICASSP 2020 - 2020 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7124–7128

  47. [47]

    Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,

    K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,” inProc. of ICASSP, 2021, pp. 7198–7202

  48. [48]

    End-to-end speaker segmentation for overlap-aware resegmentation,

    H. Bredin and A. Laurent, “End-to-end speaker segmentation for overlap-aware resegmentation,” inInterspeech 2021, 2021, pp. 3111–3115

  49. [49]

    Powerset multi-class cross entropy loss for neural speaker diarization,

    A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” inProc. of Interspeech, 2023, pp. 3222–3226

  50. [50]

    Streaming Sortformer: Speaker cache- based online speaker diarization with arrival-time ordering,

    I. Medennikovet al., “Streaming Sortformer: Speaker cache- based online speaker diarization with arrival-time ordering,” in Proc. of Interspeech, 2025, pp. 5238–5242

  51. [51]

    Encoder-decoder based attractors for end-to-end neural di- arization,

    S. Horiguchi, Y . Fujita, S. Watanabe, Y . Xue, and P. Garc ´ıa, “Encoder-decoder based attractors for end-to-end neural di- arization,”IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 30, p. 1493–1507, Mar. 2022. [Online]. Available: https://doi.org/10.1109/TASLP.2022.3162080

  52. [52]

    Online neural diarization of unlimited numbers of speakers using global and local attractors,

    S. Horiguchi, S. Watanabe, P. Garc ´ıa, Y . Takashima, and Y . Kawaguchi, “Online neural diarization of unlimited numbers of speakers using global and local attractors,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 31, pp. 706–720, 2022

  53. [53]

    NEST: Self-supervised Fast Conformer as all-purpose seasoning to speech processing tasks,

    H. Huanget al., “NEST: Self-supervised Fast Conformer as all-purpose seasoning to speech processing tasks,” inProc. of ICASSP, 2025, pp. 1–5

  54. [54]

    Fast Conformer with linearly scalable attention for efficient speech recognition,

    D. Rekeshet al., “Fast Conformer with linearly scalable attention for efficient speech recognition,” inProc. of ASRU, 2023, pp. 1–8

  55. [55]

    Lhotse: A speech data representation library for the modern deep learning ecosystem,

    P. ˙Zelasko, D. Povey, J. Y . Trmal, and S. Khudanpur, “Lhotse: A speech data representation library for the modern deep learning ecosystem,” 2021. [Online]. Available: https: //arxiv.org/abs/2110.12561

  56. [56]

    Pyroomacoustics: A Python package for audio room simulation and array processing algorithms,

    R. Scheibler, E. Bezzam, and I. Dokmani ´c, “Pyroomacoustics: A Python package for audio room simulation and array processing algorithms,” inProc. of ICASSP, 2018, pp. 351–355

  57. [57]

    Improving the naturalness of simulated conversations for end-to-end neural di- arization,

    N. Yamashita, S. Horiguchi, and T. Homma, “Improving the naturalness of simulated conversations for end-to-end neural di- arization,” inThe Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 133–140

  58. [58]

    Lib- rispeech: An ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inProc. of ICASSP, 2015, pp. 5206–5210

  59. [59]

    V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,

    C. Wanget al., “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li,...

  60. [60]

    otospeech-full-duplex-processed-141h: Full-duplex conversational speech dataset,

    otoearth, “otospeech-full-duplex-processed-141h: Full-duplex conversational speech dataset,” https://huggingface.co/datasets/ otoearth/otoSpeech-full-duplex-processed-141h, 2026, license: CC BY 4.0

  61. [61]

    Montreal forced aligner: Trainable text-speech align- ment using Kaldi,

    M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal forced aligner: Trainable text-speech align- ment using Kaldi,” inProc. of Interspeech, 2017, pp. 498–502

  62. [62]

    MUSAN: A Music, Speech, and Noise Corpus

    D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” 2015, arXiv:1510.08484v1

  63. [63]

    Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,

    N. Kandaet al., “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” inProc. of Interspeech, 2020, pp. 36–40

  64. [64]

    Mixer 6,

    L. Brandschainet al., “Mixer 6,” inProceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), N. C. C. Chair)et al., Eds. Valletta, Malta: European Language Resources Association (ELRA), may 2010

  65. [65]

    M2MeT: The ICASSP 2022 multi-channel multi- party meeting transcription challenge,

    F. Yuet al., “M2MeT: The ICASSP 2022 multi-channel multi- party meeting transcription challenge,” inProc. of ICASSP. IEEE, 2022

  66. [66]

    Summary on the ICASSP 2022 multi-channel multi- party meeting transcription grand challenge,

    ——, “Summary on the ICASSP 2022 multi-channel multi- party meeting transcription grand challenge,” inProc. of ICASSP. IEEE, 2022

  67. [67]

    The third DIHARD diarization challenge,

    N. Ryantet al., “The third DIHARD diarization challenge,” in Proc. of Interspeech, 2021, pp. 3570–3574

  68. [68]

    MSDWild: Multi-modal speaker diarization dataset in the wild,

    T. Liuet al., “MSDWild: Multi-modal speaker diarization dataset in the wild,” inProc. of Interspeech, 2022, pp. 1476–1480

  69. [69]

    Can we really repurpose multi-speaker ASR corpus for speaker diarization?

    S. Horiguchi, N. Tawara, T. Ashihara, A. Ando, and M. Delcroix, “Can we really repurpose multi-speaker ASR corpus for speaker diarization?” inProc. of ASRU, Dec 2025

  70. [70]

    Spot the conversation: speaker diarisation in the wild,

    J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman, “Spot the conversation: speaker diarisation in the wild,” inProc. of Interspeech, 2020, pp. 299–303

  71. [71]

    MeetEval: A toolkit for computation of word error rates for meeting transcription systems,

    T. v. Neumann, C. B. Boeddeker, M. Delcroix, and R. Haeb- Umbach, “MeetEval: A toolkit for computation of word error rates for meeting transcription systems,” inProc. of CHiME, 2023, pp. 27–32

  72. [72]

    NeMo: a toolkit for building AI applications using neural modules,

    O. Kuchaievet al., “NeMo: a toolkit for building AI applications using neural modules,”arXiv preprint arXiv:1909.09577, 2019