Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization

arxiv: 2605.15442 · v1 · submitted 2026-05-14 · 📡 eess.AS

Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization

Alexander Polok , Ivan Medennikov , Jan \v{C}ernock\'y , Shinji Watanabe , Luk\'a\v{s} Burget , Samuele Cornell This is my paper

Pith reviewed 2026-05-19 14:34 UTC · model grok-4.3

classification 📡 eess.AS

keywords multi-talker ASRspeaker diarizationsynthetic dataconversational speechFastMSSspeech overlapsource diversitydata mixing

0 comments p. Extension

The pith

Synthetic conversational data approaches real-data baselines and mixing both yields substantial gains for multi-talker ASR and speaker diarization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how specific choices in generating synthetic conversational audio affect training for multi-talker automatic speech recognition and speaker diarization. It introduces FastMSS, an efficient open-source simulator, to test the effects of speech overlap, source domain diversity, acoustic augmentation, and mixing strategies. Results show that optimal simulation depends on the task, with more overlap helping recognition but hurting diarization, and broad source variety beating exact domain matches. Synthetic-only training nearly reaches real-data performance, but the largest improvements come from combining synthetic and real recordings.

Core claim

Optimal simulation recipes are highly task-dependent: increasing speech overlap benefits ASR but degrades diarization. Broad source diversity consistently outperforms exact domain matching. Synthetic-only training approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training across both tasks.

What carries the argument

FastMSS, a highly efficient open-source simulator for generating synthetic multi-speaker mixtures, used to analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies.

If this is right

Increasing speech overlap improves multi-talker ASR but degrades speaker diarization performance.
Broad source diversity for simulation works better than exact domain matching.
Synthetic-only training nearly matches real-data baselines for both tasks.
Combining synthetic and real data produces clear gains over real-only training on ASR and diarization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Simulation parameters may need separate tuning for recognition versus diarization systems rather than a single recipe.
Wider use of such simulators could reduce dependence on scarce real conversational recordings for system development.
The task-dependent findings could guide simulation design for related audio processing problems like noise-robust speech separation.

Load-bearing premise

The simulation choices and acoustic augmentations in FastMSS produce mixtures whose statistical properties are close enough to real conversational recordings that performance trends will transfer.

What would settle it

A test set of real multi-talker recordings where models trained on the best synthetic mixtures underperform models trained only on real data would call the main claims into question.

Figures

Figures reproduced from arXiv: 2605.15442 by Alexander Polok, Ivan Medennikov, Jan \v{C}ernock\'y, Luk\'a\v{s} Burget, Samuele Cornell, Shinji Watanabe.

**Figure 1.** Figure 1: demonstrates FastMSS’s scalability against two widely used simulators, MMS-MSG [17] and the NeMo multi-speaker data simulator [18], benchmarked on identical hardware (4× AMD EPYC 7742, 256 CPUs, 1 TB RAM) generating 6,000 two-minute meetings from LibriSpeech train-clean-100 without reverberation or noise. FastMSS scales efficiently from single up to 32 processes, generating 1,000 hours of annotated multi-… view at source ↗

read the original abstract

Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixtures and real-world interactions, we present a study of synthetic data generation for leading MT-ASR (DiCoW) and SD (Sortformer) systems. By introducing FastMSS, a highly efficient open-source simulator, we analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies. Our findings reveal that optimal simulation recipes are highly task-dependent: increasing speech overlap benefits ASR but degrades diarization. Furthermore, broad source diversity consistently outperforms exact domain matching. Ultimately, synthetic-only training approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training across both tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Synthetic data gets close to real baselines and mixing helps, but gains may trace to data volume rather than the claimed simulation properties.

read the letter

The main takeaway is that synthetic-only training approaches real-data baselines for multi-talker ASR and diarization, while adding simulated mixtures to real recordings produces further gains. The paper also reports that simulation parameters affect the two tasks differently: more overlap helps ASR but hurts diarization, and broad source diversity beats exact domain matching. They introduce FastMSS as an efficient simulator to run these comparisons on DiCoW and Sortformer. These observations give practical pointers for anyone generating training data when real conversational recordings are scarce.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FastMSS, an efficient open-source simulator for synthetic conversational mixtures, and systematically studies the impact of simulation choices (turn-taking/overlap, source domain diversity, acoustic augmentations, and mixing strategies) on multi-talker ASR using DiCoW and speaker diarization using Sortformer. Key claims are that optimal recipes are task-dependent (overlap helps ASR but hurts diarization), broad source diversity outperforms exact domain matching, synthetic-only training approaches real-data baselines, and mixing synthetic with real data produces substantial gains over real-only training for both tasks.

Significance. If the reported trends hold after controlling for confounds, the work offers practical guidance on synthetic data generation for conversational speech tasks where real recordings are scarce. Strengths include the open-source release of FastMSS and the explicit comparison of task-specific simulation effects; these could help researchers prioritize broad diversity and overlap tuning when augmenting training sets for MT-ASR and diarization systems.

major comments (2)

The central claim that mixing synthetic data with real recordings yields substantial gains (and that synthetic-only approaches real baselines) requires that performance differences arise from the statistical properties of FastMSS mixtures rather than simply increased total training data volume. The manuscript does not describe volume-matched real-only baselines or equivalent augmentations applied to the real data; without such controls the observed gains cannot be unambiguously attributed to the simulation choices.
No error bars, statistical significance tests, or details on the number of experimental runs are reported for the performance trends summarized in the abstract. This makes it difficult to assess whether the claimed 'substantial gains' and task-dependent effects are reliable or could be explained by run-to-run variability.

minor comments (1)

The description of FastMSS could include a brief pseudocode or parameter table to clarify how turn-taking, source selection, and augmentations are implemented, aiding reproducibility even though the code is open-source.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the work.

read point-by-point responses

Referee: The central claim that mixing synthetic data with real recordings yields substantial gains (and that synthetic-only approaches real baselines) requires that performance differences arise from the statistical properties of FastMSS mixtures rather than simply increased total training data volume. The manuscript does not describe volume-matched real-only baselines or equivalent augmentations applied to the real data; without such controls the observed gains cannot be unambiguously attributed to the simulation choices.

Authors: We agree that this is an important control to isolate the contribution of FastMSS simulation choices. The current experiments compare synthetic-only, real-only, and mixed conditions using the full available real data volume without explicit volume matching or equivalent real-data augmentations. In the revised manuscript we will add volume-matched real-only baselines (by subsampling or applying comparable augmentations to the real data to equalize total training hours) and report the corresponding results. This will allow us to attribute performance differences more clearly to the statistical properties of the synthetic mixtures. revision: yes
Referee: No error bars, statistical significance tests, or details on the number of experimental runs are reported for the performance trends summarized in the abstract. This makes it difficult to assess whether the claimed 'substantial gains' and task-dependent effects are reliable or could be explained by run-to-run variability.

Authors: We acknowledge that reporting variability and statistical significance would improve confidence in the reported trends. The experiments presented were performed as single runs per configuration, primarily due to the substantial computational cost of training DiCoW and Sortformer models. In the revision we will rerun the key configurations (synthetic-only, real-only, and mixed) across multiple random seeds, report error bars or standard deviations, and include statistical significance tests for the main claims regarding substantial gains and task-dependent effects. revision: yes

Circularity Check

0 steps flagged

Empirical comparisons independent of fitted parameters or load-bearing self-citations

full rationale

The paper conducts an empirical study comparing synthetic training data generated via the introduced FastMSS simulator against held-out real-data baselines for MT-ASR and speaker diarization. No equations, derivations, or fitted parameters are described that would reduce the reported gains (synthetic-only approaching real baselines, or mixing yielding substantial improvements) to quantities defined by the same inputs. Simulation choices and augmentations are presented as experimental variables whose effects are measured externally, with no self-citation chains or uniqueness theorems invoked to justify core claims. This yields a minor score for possible incidental self-citations on baseline systems but keeps the central results self-contained against real recordings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claims rest on the domain assumption that synthetic mixtures generated under the tested conditions are sufficiently representative of real conversational statistics to support the reported performance ordering.

axioms (1)

domain assumption Synthetic mixtures generated by FastMSS and chosen augmentations capture the relevant acoustic and turn-taking statistics of real multi-talker conversations
Invoked when claiming that synthetic-only training approaches real-data baselines

invented entities (1)

FastMSS simulator no independent evidence
purpose: Efficient generation of synthetic multi-speaker mixtures for controlled experimentation
New tool introduced to enable the parameter sweeps described

pith-pipeline@v0.9.0 · 5710 in / 1195 out tokens · 58023 ms · 2026-05-19T14:34:17.220256+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FastMSS Turn-Taking (TT) model simply extends the two-speaker HMM-based approach... Four utterance transition types... overlap extent (IR) is drawn as a ratio from a truncated exponential
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

optimal simulation recipes are task-dependent: boosting overlap improves MT-ASR but degrades diarization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 3 internal anchors

[1]

in-the-wild

Introduction Multi-talker conversational speech processing is undergoing a rapid transformation, driven largely by the shift from highly specialized pipelines to less data-hungry methods built on pretrained foundation models [1–4]. By leveraging massive amounts of single-speaker or self-supervised data, these foun- dational backbones can be effectively fi...

work page
[2]

Multi-Talker Speech Processing 2.1. Multi-Talker ASR: DiCoW Multi-talker ASR (MT-ASR) has traditionally been tackled through modular separation-based pipelines [32], end-to-end architectures like Serialized Output Training (SOT) [33], or target-speaker conditioning [34]. The latter paradigm has seen rapid advancement with the introduction of diarization c...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Multi-Speaker Conversation Simulation To enable controlled and fast experimentation along the axes described above, we developed FastMSS, an open-source multi- speaker conversation simulator focused on scalable generation with native Lhotse [46] integration. Given a set of single- speaker utterances from a source dataset, FastMSS generates multi-talker mi...

work page
[4]

Experimental Setup 4.1. Datasets As source domains for synthetic generation, we use: Lib- riSpeech [49] (read speech, 960h), V oxPopuli [50] (semi- spontaneous parliamentary speech, 543h), otoSpeech [51] (full- duplex conversational speech, 141h), and the close-talk chan- nels of AMI Meeting Corpus [11] and NOTSOFAR-1 (NSF-

work page
[5]

All datasets were re-aligned using the Montreal Forced Aligner [52] to ensure consistent word-level timestamps

[12] (spontaneous meetings). All datasets were re-aligned using the Montreal Forced Aligner [52] to ensure consistent word-level timestamps. Noises for data augmentation are taken from the MUSAN [53], with “speech” noises excluded. For DiCoW, we evaluate primarily on AMI Single Dis- tant Microphone (SDM) and NSF-1 [12] Single-Channel (SC), alongside Libri...

work page
[6]

Impact of Turn-Taking Dynamics In Table 1, we isolate the effect of turn-taking by varying only the simulator transition model parameters while keeping all other factors fixed

Results 5.1. Impact of Turn-Taking Dynamics In Table 1, we isolate the effect of turn-taking by varying only the simulator transition model parameters while keeping all other factors fixed. For DiCoW, the source utterances are NSF-1 close-talk (∼500h simulated from∼7.5h); for Sortformer, Lib- riSpeech (2,000h simulated from 960h), without augmentation. Fo...

work page
[7]

Conclusions We presented a systematic study of synthetic conversational data for multi-talker speech processing, investigating the impact of turn-taking dynamics, source domain, and data combination strategies on target-speaker ASR (DiCoW) and speaker diariza- tion (Sortformer). Our main findings are fourfold: (i) optimal simulation recipes are task-depen...

work page
[8]

Linguistics, Artificial Intelligence and Lan- guage and Speech Technologies: from Research to Applica- tions

Acknowledgements This work was partially conducted at the 2025 JSALT work- shop. Support was provided by the Ministry of Education, Youth and Sports of the Czech Republic (MoE) through the OP JAK project “Linguistics, Artificial Intelligence and Lan- guage and Speech Technologies: from Research to Applica- tions” (ID:CZ.02.01.01/00/23 020/0008518), and Br...

work page 2025
[9]

Generative AI Use Disclosure Generative AI tools have only been used to help revise the manuscript

work page
[10]

Robust speech recognition via large-scale weak supervision,

A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023
[11]

WavLM: Large-scale self-supervised pre-training for full stack speech processing,

S. Chenet al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[12]

Scaling speech technology to 1,000+ languages,

V . Pratapet al., “Scaling speech technology to 1,000+ languages,” J. Mach. Learn. Res., vol. 25, no. 1, Jan. 2024

work page 2024
[13]

Google USM: Scaling automatic speech recognition beyond 100 languages,

Y . Zhanget al., “Google USM: Scaling automatic speech recognition beyond 100 languages,” 2023. [Online]. Available: https://arxiv.org/abs/2303.01037

work page arXiv 2023
[14]

Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,

Z. Huang, D. Raj, P. Garc ´ıa, and S. Khudanpur, “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” inProc. of ICASSP, 2023, pp. 1–5

work page 2023
[15]

Adapting multi-lingual ASR models for handling multiple talkers,

C. Liet al., “Adapting multi-lingual ASR models for handling multiple talkers,” inProc. of Interspeech, 2023, pp. 1314–1318

work page 2023
[16]

Fine-tune before structured pruning: Towards com- pact and accurate self-supervised models for speaker diarization,

J. Hanet al., “Fine-tune before structured pruning: Towards com- pact and accurate self-supervised models for speaker diarization,” inProc. of Interspeech, 2025, pp. 1583–1587

work page 2025
[17]

Sortformer: A novel approach for permutation- resolved speaker supervision in speech-to-text systems,

T. Parket al., “Sortformer: A novel approach for permutation- resolved speaker supervision in speech-to-text systems,” in International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=AyYjRvrbDx

work page 2025
[18]

Beyond turn-based interfaces: Synchronous LLMs as full- duplex dialogue agents,

B. Veluri, B. N. Peloquin, B. Yu, H. Gong, and S. Gollakota, “Beyond turn-based interfaces: Synchronous LLMs as full- duplex dialogue agents,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al- Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024,...

work page 2024
[19]

Moshi: a speech-text foundation model for real-time dialogue

A. D ´efossezet al., “Moshi: a speech-text foundation model for real-time dialogue,” 2024. [Online]. Available: https: //arxiv.org/abs/2410.00037

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

The AMI meeting corpus,

I. Mccowanet al., “The AMI meeting corpus,”Int’l. Conf. on Methods and Techniques in Behavioral Research, 01 2005

work page 2005
[21]

NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,

A. Vinnikovet al., “NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,” inProc. of Interspeech, 2024, pp. 5003–5007

work page 2024
[22]

Summary of the NOTSOFAR-1 challenge: Highlights and learnings,

I. Abramovskiet al., “Summary of the NOTSOFAR-1 challenge: Highlights and learnings,”Computer Speech & Language, vol. 93, p. 101796, 2025

work page 2025
[23]

A cocktail-party benchmark: Multi-modal dataset and comparative evaluation results,

T.-B. Nguyenet al., “A cocktail-party benchmark: Multi-modal dataset and comparative evaluation results,” 2026. [Online]. Available: https://arxiv.org/abs/2510.23276

work page arXiv 2026
[24]

CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,

S. Watanabeet al., “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” inProc. of CHiME, 2020, pp. 1–7

work page 2020
[25]

The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” inInterspeech 2018, 2018, pp. 1561–1565

work page 2018
[26]

MMS-MSG: A multi-purpose multi-speaker mixture signal generator,

T. Cord-Landwehr, T. von Neumann, C. Boeddeker, and R. Haeb- Umbach, “MMS-MSG: A multi-purpose multi-speaker mixture signal generator,” inInternational Workshop on Acoustic Signal Enhancement, 2022, pp. 1–5

work page 2022
[27]

Property-aware multi-speaker data simulation: A probabilistic modelling technique for synthetic data generation,

T. J. Parket al., “Property-aware multi-speaker data simulation: A probabilistic modelling technique for synthetic data generation,” inProc. of CHiME, 2023, pp. 82–86

work page 2023
[28]

Generating data with text-to-speech and large-language models for conver- sational speech recognition,

S. Cornell, J. Darefsky, Z. Duan, and S. Watanabe, “Generating data with text-to-speech and large-language models for conver- sational speech recognition,” inProc. SynData4GenAI, 2024, pp. 6–10

work page 2024
[29]

SDialog: A Python toolkit for end-to- end agent building, user simulation, dialog generation, and evaluation,

S. Burdissoet al., “SDialog: A Python toolkit for end-to- end agent building, user simulation, dialog generation, and evaluation,” 2025. [Online]. Available: https://arxiv.org/abs/2506. 10622

work page 2025
[30]

V oicebox: text-guided multilingual universal speech generation at scale,

M. Leet al., “V oicebox: text-guided multilingual universal speech generation at scale,” inInternational Conference on Neural Infor- mation Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

work page 2023
[31]

From sim- ulated mixtures to simulated conversations as training data for end-to-end neural diarization,

F. Landini, A. Lozano-Diez, M. Diez, and L. Burget, “From sim- ulated mixtures to simulated conversations as training data for end-to-end neural diarization,” inProc. of Interspeech, 2022, pp. 5095–5099

work page 2022
[32]

Pushing the limits of end- to-end diarization,

S. J. Broughton and L. Samarakoon, “Pushing the limits of end- to-end diarization,” inProc. of Interspeech, 2025, pp. 5218–5222

work page 2025
[33]

Simulating realistic speech overlaps improves multi-talker ASR,

M. Yanget al., “Simulating realistic speech overlaps improves multi-talker ASR,” inProc. of ICASSP, 2023, pp. 1–5

work page 2023
[34]

Synthetic conversations improve multi-talker ASR,

T.-B. Nguyen and A. Waibel, “Synthetic conversations improve multi-talker ASR,” inProc. of ICASSP, 2024, pp. 10 461–10 465

work page 2024
[35]

Can synthetic speech improve end-to- end conversational speech translation?

B. Bamfo Odoomet al., “Can synthetic speech improve end-to- end conversational speech translation?” inProceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), R. Knowles, A. Eriguchi, and S. Goel, Eds. Chicago, USA: Association for Machine Translation in the Americas, Sep. 2024, pp. 167–177....

work page 2024
[36]

On the effect of purely synthetic training data for different automatic speech recognition architectures,

B. Hilmes, N. Rossenbach, and R. Schl ¨uter, “On the effect of purely synthetic training data for different automatic speech recognition architectures,” inProc. of SynData4GenAI, 2024, pp. 46–50

work page 2024
[37]

Continuous speech separation: Dataset and anal- ysis,

Z. Chenet al., “Continuous speech separation: Dataset and anal- ysis,” inProc. of ICASSP, 2020, pp. 7284–7288

work page 2020
[38]

arXiv preprint arXiv:2005.11262 , year=

J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An open-source dataset for generalizable speech separation,” 2020. [Online]. Available: https://arxiv.org/ abs/2005.11262

work page arXiv 2020
[39]

End-to-end neural speaker diarization with self- attention,

Y . Fujitaet al., “End-to-end neural speaker diarization with self- attention,” in2019 IEEE Automatic Speech Recognition and Un- derstanding Workshop (ASRU), 2019, pp. 296–303

work page 2019
[40]

DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,

A. Poloket al., “DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,”Computer Speech & Language, vol. 95, p. 101841, 2026. [Online]. Available: https:// www.sciencedirect.com/science/article/pii/S088523082500066X

work page 2026
[41]

The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,

S. Niuet al., “The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,” inProc. of CHiME, 2024, pp. 31–36

work page 2024
[42]

Seri- alized output training for end-to-end overlapped speech recogni- tion,

N. Kanda, Y . Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Seri- alized output training for end-to-end overlapped speech recogni- tion,” inProc. of Interspeech, 2020, pp. 2797–2801

work page 2020
[43]

Auxiliary interference speaker loss for target- speaker speech recognition,

N. Kandaet al., “Auxiliary interference speaker loss for target- speaker speech recognition,” inProc. of Interspeech, 2019, pp. 236–240

work page 2019
[44]

Target speaker ASR with Whisper,

A. Poloket al., “Target speaker ASR with Whisper,” inProc. of ICASSP, 2025, pp. 1–5

work page 2025
[45]

SE-DiCoW: Self-enrolled diarization-conditioned Whis- per,

——, “SE-DiCoW: Self-enrolled diarization-conditioned Whis- per,” inProc. of ICASSP, 2026

work page 2026
[46]

Pyannote.audio: Neural building blocks for speaker diarization,

H. Bredinet al., “Pyannote.audio: Neural building blocks for speaker diarization,” inICASSP 2020 - 2020 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7124–7128

work page 2020
[47]

Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,

K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,” inProc. of ICASSP, 2021, pp. 7198–7202

work page 2021
[48]

End-to-end speaker segmentation for overlap-aware resegmentation,

H. Bredin and A. Laurent, “End-to-end speaker segmentation for overlap-aware resegmentation,” inInterspeech 2021, 2021, pp. 3111–3115

work page 2021
[49]

Powerset multi-class cross entropy loss for neural speaker diarization,

A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” inProc. of Interspeech, 2023, pp. 3222–3226

work page 2023
[50]

Streaming Sortformer: Speaker cache- based online speaker diarization with arrival-time ordering,

I. Medennikovet al., “Streaming Sortformer: Speaker cache- based online speaker diarization with arrival-time ordering,” in Proc. of Interspeech, 2025, pp. 5238–5242

work page 2025
[51]

Encoder-decoder based attractors for end-to-end neural di- arization,

S. Horiguchi, Y . Fujita, S. Watanabe, Y . Xue, and P. Garc ´ıa, “Encoder-decoder based attractors for end-to-end neural di- arization,”IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 30, p. 1493–1507, Mar. 2022. [Online]. Available: https://doi.org/10.1109/TASLP.2022.3162080

work page doi:10.1109/taslp.2022.3162080 2022
[52]

Online neural diarization of unlimited numbers of speakers using global and local attractors,

S. Horiguchi, S. Watanabe, P. Garc ´ıa, Y . Takashima, and Y . Kawaguchi, “Online neural diarization of unlimited numbers of speakers using global and local attractors,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 31, pp. 706–720, 2022

work page 2022
[53]

NEST: Self-supervised Fast Conformer as all-purpose seasoning to speech processing tasks,

H. Huanget al., “NEST: Self-supervised Fast Conformer as all-purpose seasoning to speech processing tasks,” inProc. of ICASSP, 2025, pp. 1–5

work page 2025
[54]

Fast Conformer with linearly scalable attention for efficient speech recognition,

D. Rekeshet al., “Fast Conformer with linearly scalable attention for efficient speech recognition,” inProc. of ASRU, 2023, pp. 1–8

work page 2023
[55]

Lhotse: A speech data representation library for the modern deep learning ecosystem,

P. ˙Zelasko, D. Povey, J. Y . Trmal, and S. Khudanpur, “Lhotse: A speech data representation library for the modern deep learning ecosystem,” 2021. [Online]. Available: https: //arxiv.org/abs/2110.12561

work page arXiv 2021
[56]

Pyroomacoustics: A Python package for audio room simulation and array processing algorithms,

R. Scheibler, E. Bezzam, and I. Dokmani ´c, “Pyroomacoustics: A Python package for audio room simulation and array processing algorithms,” inProc. of ICASSP, 2018, pp. 351–355

work page 2018
[57]

Improving the naturalness of simulated conversations for end-to-end neural di- arization,

N. Yamashita, S. Horiguchi, and T. Homma, “Improving the naturalness of simulated conversations for end-to-end neural di- arization,” inThe Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 133–140

work page 2022
[58]

Lib- rispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inProc. of ICASSP, 2015, pp. 5206–5210

work page 2015
[59]

V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,

C. Wanget al., “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li,...

work page 2021
[60]

otospeech-full-duplex-processed-141h: Full-duplex conversational speech dataset,

otoearth, “otospeech-full-duplex-processed-141h: Full-duplex conversational speech dataset,” https://huggingface.co/datasets/ otoearth/otoSpeech-full-duplex-processed-141h, 2026, license: CC BY 4.0

work page 2026
[61]

Montreal forced aligner: Trainable text-speech align- ment using Kaldi,

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal forced aligner: Trainable text-speech align- ment using Kaldi,” inProc. of Interspeech, 2017, pp. 498–502

work page 2017
[62]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” 2015, arXiv:1510.08484v1

work page internal anchor Pith review Pith/arXiv arXiv 2015
[63]

Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,

N. Kandaet al., “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” inProc. of Interspeech, 2020, pp. 36–40

work page 2020
[64]

Mixer 6,

L. Brandschainet al., “Mixer 6,” inProceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), N. C. C. Chair)et al., Eds. Valletta, Malta: European Language Resources Association (ELRA), may 2010

work page 2010
[65]

M2MeT: The ICASSP 2022 multi-channel multi- party meeting transcription challenge,

F. Yuet al., “M2MeT: The ICASSP 2022 multi-channel multi- party meeting transcription challenge,” inProc. of ICASSP. IEEE, 2022

work page 2022
[66]

Summary on the ICASSP 2022 multi-channel multi- party meeting transcription grand challenge,

——, “Summary on the ICASSP 2022 multi-channel multi- party meeting transcription grand challenge,” inProc. of ICASSP. IEEE, 2022

work page 2022
[67]

The third DIHARD diarization challenge,

N. Ryantet al., “The third DIHARD diarization challenge,” in Proc. of Interspeech, 2021, pp. 3570–3574

work page 2021
[68]

MSDWild: Multi-modal speaker diarization dataset in the wild,

T. Liuet al., “MSDWild: Multi-modal speaker diarization dataset in the wild,” inProc. of Interspeech, 2022, pp. 1476–1480

work page 2022
[69]

Can we really repurpose multi-speaker ASR corpus for speaker diarization?

S. Horiguchi, N. Tawara, T. Ashihara, A. Ando, and M. Delcroix, “Can we really repurpose multi-speaker ASR corpus for speaker diarization?” inProc. of ASRU, Dec 2025

work page 2025
[70]

Spot the conversation: speaker diarisation in the wild,

J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman, “Spot the conversation: speaker diarisation in the wild,” inProc. of Interspeech, 2020, pp. 299–303

work page 2020
[71]

MeetEval: A toolkit for computation of word error rates for meeting transcription systems,

T. v. Neumann, C. B. Boeddeker, M. Delcroix, and R. Haeb- Umbach, “MeetEval: A toolkit for computation of word error rates for meeting transcription systems,” inProc. of CHiME, 2023, pp. 27–32

work page 2023
[72]

NeMo: a toolkit for building AI applications using neural modules,

O. Kuchaievet al., “NeMo: a toolkit for building AI applications using neural modules,”arXiv preprint arXiv:1909.09577, 2019

work page arXiv 1909

[1] [1]

in-the-wild

Introduction Multi-talker conversational speech processing is undergoing a rapid transformation, driven largely by the shift from highly specialized pipelines to less data-hungry methods built on pretrained foundation models [1–4]. By leveraging massive amounts of single-speaker or self-supervised data, these foun- dational backbones can be effectively fi...

work page

[2] [2]

Multi-Talker Speech Processing 2.1. Multi-Talker ASR: DiCoW Multi-talker ASR (MT-ASR) has traditionally been tackled through modular separation-based pipelines [32], end-to-end architectures like Serialized Output Training (SOT) [33], or target-speaker conditioning [34]. The latter paradigm has seen rapid advancement with the introduction of diarization c...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Multi-Speaker Conversation Simulation To enable controlled and fast experimentation along the axes described above, we developed FastMSS, an open-source multi- speaker conversation simulator focused on scalable generation with native Lhotse [46] integration. Given a set of single- speaker utterances from a source dataset, FastMSS generates multi-talker mi...

work page

[4] [4]

Experimental Setup 4.1. Datasets As source domains for synthetic generation, we use: Lib- riSpeech [49] (read speech, 960h), V oxPopuli [50] (semi- spontaneous parliamentary speech, 543h), otoSpeech [51] (full- duplex conversational speech, 141h), and the close-talk chan- nels of AMI Meeting Corpus [11] and NOTSOFAR-1 (NSF-

work page

[5] [5]

All datasets were re-aligned using the Montreal Forced Aligner [52] to ensure consistent word-level timestamps

[12] (spontaneous meetings). All datasets were re-aligned using the Montreal Forced Aligner [52] to ensure consistent word-level timestamps. Noises for data augmentation are taken from the MUSAN [53], with “speech” noises excluded. For DiCoW, we evaluate primarily on AMI Single Dis- tant Microphone (SDM) and NSF-1 [12] Single-Channel (SC), alongside Libri...

work page

[6] [6]

Impact of Turn-Taking Dynamics In Table 1, we isolate the effect of turn-taking by varying only the simulator transition model parameters while keeping all other factors fixed

Results 5.1. Impact of Turn-Taking Dynamics In Table 1, we isolate the effect of turn-taking by varying only the simulator transition model parameters while keeping all other factors fixed. For DiCoW, the source utterances are NSF-1 close-talk (∼500h simulated from∼7.5h); for Sortformer, Lib- riSpeech (2,000h simulated from 960h), without augmentation. Fo...

work page

[7] [7]

Conclusions We presented a systematic study of synthetic conversational data for multi-talker speech processing, investigating the impact of turn-taking dynamics, source domain, and data combination strategies on target-speaker ASR (DiCoW) and speaker diariza- tion (Sortformer). Our main findings are fourfold: (i) optimal simulation recipes are task-depen...

work page

[8] [8]

Linguistics, Artificial Intelligence and Lan- guage and Speech Technologies: from Research to Applica- tions

Acknowledgements This work was partially conducted at the 2025 JSALT work- shop. Support was provided by the Ministry of Education, Youth and Sports of the Czech Republic (MoE) through the OP JAK project “Linguistics, Artificial Intelligence and Lan- guage and Speech Technologies: from Research to Applica- tions” (ID:CZ.02.01.01/00/23 020/0008518), and Br...

work page 2025

[9] [9]

Generative AI Use Disclosure Generative AI tools have only been used to help revise the manuscript

work page

[10] [10]

Robust speech recognition via large-scale weak supervision,

A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023

[11] [11]

WavLM: Large-scale self-supervised pre-training for full stack speech processing,

S. Chenet al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022

[12] [12]

Scaling speech technology to 1,000+ languages,

V . Pratapet al., “Scaling speech technology to 1,000+ languages,” J. Mach. Learn. Res., vol. 25, no. 1, Jan. 2024

work page 2024

[13] [13]

Google USM: Scaling automatic speech recognition beyond 100 languages,

Y . Zhanget al., “Google USM: Scaling automatic speech recognition beyond 100 languages,” 2023. [Online]. Available: https://arxiv.org/abs/2303.01037

work page arXiv 2023

[14] [14]

Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,

Z. Huang, D. Raj, P. Garc ´ıa, and S. Khudanpur, “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” inProc. of ICASSP, 2023, pp. 1–5

work page 2023

[15] [15]

Adapting multi-lingual ASR models for handling multiple talkers,

C. Liet al., “Adapting multi-lingual ASR models for handling multiple talkers,” inProc. of Interspeech, 2023, pp. 1314–1318

work page 2023

[16] [16]

Fine-tune before structured pruning: Towards com- pact and accurate self-supervised models for speaker diarization,

J. Hanet al., “Fine-tune before structured pruning: Towards com- pact and accurate self-supervised models for speaker diarization,” inProc. of Interspeech, 2025, pp. 1583–1587

work page 2025

[17] [17]

Sortformer: A novel approach for permutation- resolved speaker supervision in speech-to-text systems,

T. Parket al., “Sortformer: A novel approach for permutation- resolved speaker supervision in speech-to-text systems,” in International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=AyYjRvrbDx

work page 2025

[18] [18]

Beyond turn-based interfaces: Synchronous LLMs as full- duplex dialogue agents,

B. Veluri, B. N. Peloquin, B. Yu, H. Gong, and S. Gollakota, “Beyond turn-based interfaces: Synchronous LLMs as full- duplex dialogue agents,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al- Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024,...

work page 2024

[19] [19]

Moshi: a speech-text foundation model for real-time dialogue

A. D ´efossezet al., “Moshi: a speech-text foundation model for real-time dialogue,” 2024. [Online]. Available: https: //arxiv.org/abs/2410.00037

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

The AMI meeting corpus,

I. Mccowanet al., “The AMI meeting corpus,”Int’l. Conf. on Methods and Techniques in Behavioral Research, 01 2005

work page 2005

[21] [21]

NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,

A. Vinnikovet al., “NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,” inProc. of Interspeech, 2024, pp. 5003–5007

work page 2024

[22] [22]

Summary of the NOTSOFAR-1 challenge: Highlights and learnings,

I. Abramovskiet al., “Summary of the NOTSOFAR-1 challenge: Highlights and learnings,”Computer Speech & Language, vol. 93, p. 101796, 2025

work page 2025

[23] [23]

A cocktail-party benchmark: Multi-modal dataset and comparative evaluation results,

T.-B. Nguyenet al., “A cocktail-party benchmark: Multi-modal dataset and comparative evaluation results,” 2026. [Online]. Available: https://arxiv.org/abs/2510.23276

work page arXiv 2026

[24] [24]

CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,

S. Watanabeet al., “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” inProc. of CHiME, 2020, pp. 1–7

work page 2020

[25] [25]

The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” inInterspeech 2018, 2018, pp. 1561–1565

work page 2018

[26] [26]

MMS-MSG: A multi-purpose multi-speaker mixture signal generator,

T. Cord-Landwehr, T. von Neumann, C. Boeddeker, and R. Haeb- Umbach, “MMS-MSG: A multi-purpose multi-speaker mixture signal generator,” inInternational Workshop on Acoustic Signal Enhancement, 2022, pp. 1–5

work page 2022

[27] [27]

Property-aware multi-speaker data simulation: A probabilistic modelling technique for synthetic data generation,

T. J. Parket al., “Property-aware multi-speaker data simulation: A probabilistic modelling technique for synthetic data generation,” inProc. of CHiME, 2023, pp. 82–86

work page 2023

[28] [28]

Generating data with text-to-speech and large-language models for conver- sational speech recognition,

S. Cornell, J. Darefsky, Z. Duan, and S. Watanabe, “Generating data with text-to-speech and large-language models for conver- sational speech recognition,” inProc. SynData4GenAI, 2024, pp. 6–10

work page 2024

[29] [29]

SDialog: A Python toolkit for end-to- end agent building, user simulation, dialog generation, and evaluation,

S. Burdissoet al., “SDialog: A Python toolkit for end-to- end agent building, user simulation, dialog generation, and evaluation,” 2025. [Online]. Available: https://arxiv.org/abs/2506. 10622

work page 2025

[30] [30]

V oicebox: text-guided multilingual universal speech generation at scale,

M. Leet al., “V oicebox: text-guided multilingual universal speech generation at scale,” inInternational Conference on Neural Infor- mation Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

work page 2023

[31] [31]

From sim- ulated mixtures to simulated conversations as training data for end-to-end neural diarization,

F. Landini, A. Lozano-Diez, M. Diez, and L. Burget, “From sim- ulated mixtures to simulated conversations as training data for end-to-end neural diarization,” inProc. of Interspeech, 2022, pp. 5095–5099

work page 2022

[32] [32]

Pushing the limits of end- to-end diarization,

S. J. Broughton and L. Samarakoon, “Pushing the limits of end- to-end diarization,” inProc. of Interspeech, 2025, pp. 5218–5222

work page 2025

[33] [33]

Simulating realistic speech overlaps improves multi-talker ASR,

M. Yanget al., “Simulating realistic speech overlaps improves multi-talker ASR,” inProc. of ICASSP, 2023, pp. 1–5

work page 2023

[34] [34]

Synthetic conversations improve multi-talker ASR,

T.-B. Nguyen and A. Waibel, “Synthetic conversations improve multi-talker ASR,” inProc. of ICASSP, 2024, pp. 10 461–10 465

work page 2024

[35] [35]

Can synthetic speech improve end-to- end conversational speech translation?

B. Bamfo Odoomet al., “Can synthetic speech improve end-to- end conversational speech translation?” inProceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), R. Knowles, A. Eriguchi, and S. Goel, Eds. Chicago, USA: Association for Machine Translation in the Americas, Sep. 2024, pp. 167–177....

work page 2024

[36] [36]

On the effect of purely synthetic training data for different automatic speech recognition architectures,

B. Hilmes, N. Rossenbach, and R. Schl ¨uter, “On the effect of purely synthetic training data for different automatic speech recognition architectures,” inProc. of SynData4GenAI, 2024, pp. 46–50

work page 2024

[37] [37]

Continuous speech separation: Dataset and anal- ysis,

Z. Chenet al., “Continuous speech separation: Dataset and anal- ysis,” inProc. of ICASSP, 2020, pp. 7284–7288

work page 2020

[38] [38]

arXiv preprint arXiv:2005.11262 , year=

J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An open-source dataset for generalizable speech separation,” 2020. [Online]. Available: https://arxiv.org/ abs/2005.11262

work page arXiv 2020

[39] [39]

End-to-end neural speaker diarization with self- attention,

Y . Fujitaet al., “End-to-end neural speaker diarization with self- attention,” in2019 IEEE Automatic Speech Recognition and Un- derstanding Workshop (ASRU), 2019, pp. 296–303

work page 2019

[40] [40]

DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,

A. Poloket al., “DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,”Computer Speech & Language, vol. 95, p. 101841, 2026. [Online]. Available: https:// www.sciencedirect.com/science/article/pii/S088523082500066X

work page 2026

[41] [41]

The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,

S. Niuet al., “The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,” inProc. of CHiME, 2024, pp. 31–36

work page 2024

[42] [42]

Seri- alized output training for end-to-end overlapped speech recogni- tion,

N. Kanda, Y . Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Seri- alized output training for end-to-end overlapped speech recogni- tion,” inProc. of Interspeech, 2020, pp. 2797–2801

work page 2020

[43] [43]

Auxiliary interference speaker loss for target- speaker speech recognition,

N. Kandaet al., “Auxiliary interference speaker loss for target- speaker speech recognition,” inProc. of Interspeech, 2019, pp. 236–240

work page 2019

[44] [44]

Target speaker ASR with Whisper,

A. Poloket al., “Target speaker ASR with Whisper,” inProc. of ICASSP, 2025, pp. 1–5

work page 2025

[45] [45]

SE-DiCoW: Self-enrolled diarization-conditioned Whis- per,

——, “SE-DiCoW: Self-enrolled diarization-conditioned Whis- per,” inProc. of ICASSP, 2026

work page 2026

[46] [46]

Pyannote.audio: Neural building blocks for speaker diarization,

H. Bredinet al., “Pyannote.audio: Neural building blocks for speaker diarization,” inICASSP 2020 - 2020 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7124–7128

work page 2020

[47] [47]

Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,

K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,” inProc. of ICASSP, 2021, pp. 7198–7202

work page 2021

[48] [48]

End-to-end speaker segmentation for overlap-aware resegmentation,

H. Bredin and A. Laurent, “End-to-end speaker segmentation for overlap-aware resegmentation,” inInterspeech 2021, 2021, pp. 3111–3115

work page 2021

[49] [49]

Powerset multi-class cross entropy loss for neural speaker diarization,

A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” inProc. of Interspeech, 2023, pp. 3222–3226

work page 2023

[50] [50]

Streaming Sortformer: Speaker cache- based online speaker diarization with arrival-time ordering,

I. Medennikovet al., “Streaming Sortformer: Speaker cache- based online speaker diarization with arrival-time ordering,” in Proc. of Interspeech, 2025, pp. 5238–5242

work page 2025

[51] [51]

Encoder-decoder based attractors for end-to-end neural di- arization,

S. Horiguchi, Y . Fujita, S. Watanabe, Y . Xue, and P. Garc ´ıa, “Encoder-decoder based attractors for end-to-end neural di- arization,”IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 30, p. 1493–1507, Mar. 2022. [Online]. Available: https://doi.org/10.1109/TASLP.2022.3162080

work page doi:10.1109/taslp.2022.3162080 2022

[52] [52]

Online neural diarization of unlimited numbers of speakers using global and local attractors,

S. Horiguchi, S. Watanabe, P. Garc ´ıa, Y . Takashima, and Y . Kawaguchi, “Online neural diarization of unlimited numbers of speakers using global and local attractors,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 31, pp. 706–720, 2022

work page 2022

[53] [53]

NEST: Self-supervised Fast Conformer as all-purpose seasoning to speech processing tasks,

H. Huanget al., “NEST: Self-supervised Fast Conformer as all-purpose seasoning to speech processing tasks,” inProc. of ICASSP, 2025, pp. 1–5

work page 2025

[54] [54]

Fast Conformer with linearly scalable attention for efficient speech recognition,

D. Rekeshet al., “Fast Conformer with linearly scalable attention for efficient speech recognition,” inProc. of ASRU, 2023, pp. 1–8

work page 2023

[55] [55]

Lhotse: A speech data representation library for the modern deep learning ecosystem,

P. ˙Zelasko, D. Povey, J. Y . Trmal, and S. Khudanpur, “Lhotse: A speech data representation library for the modern deep learning ecosystem,” 2021. [Online]. Available: https: //arxiv.org/abs/2110.12561

work page arXiv 2021

[56] [56]

Pyroomacoustics: A Python package for audio room simulation and array processing algorithms,

R. Scheibler, E. Bezzam, and I. Dokmani ´c, “Pyroomacoustics: A Python package for audio room simulation and array processing algorithms,” inProc. of ICASSP, 2018, pp. 351–355

work page 2018

[57] [57]

Improving the naturalness of simulated conversations for end-to-end neural di- arization,

N. Yamashita, S. Horiguchi, and T. Homma, “Improving the naturalness of simulated conversations for end-to-end neural di- arization,” inThe Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 133–140

work page 2022

[58] [58]

Lib- rispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inProc. of ICASSP, 2015, pp. 5206–5210

work page 2015

[59] [59]

V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,

C. Wanget al., “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li,...

work page 2021

[60] [60]

otospeech-full-duplex-processed-141h: Full-duplex conversational speech dataset,

otoearth, “otospeech-full-duplex-processed-141h: Full-duplex conversational speech dataset,” https://huggingface.co/datasets/ otoearth/otoSpeech-full-duplex-processed-141h, 2026, license: CC BY 4.0

work page 2026

[61] [61]

Montreal forced aligner: Trainable text-speech align- ment using Kaldi,

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal forced aligner: Trainable text-speech align- ment using Kaldi,” inProc. of Interspeech, 2017, pp. 498–502

work page 2017

[62] [62]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” 2015, arXiv:1510.08484v1

work page internal anchor Pith review Pith/arXiv arXiv 2015

[63] [63]

Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,

N. Kandaet al., “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” inProc. of Interspeech, 2020, pp. 36–40

work page 2020

[64] [64]

Mixer 6,

L. Brandschainet al., “Mixer 6,” inProceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), N. C. C. Chair)et al., Eds. Valletta, Malta: European Language Resources Association (ELRA), may 2010

work page 2010

[65] [65]

M2MeT: The ICASSP 2022 multi-channel multi- party meeting transcription challenge,

F. Yuet al., “M2MeT: The ICASSP 2022 multi-channel multi- party meeting transcription challenge,” inProc. of ICASSP. IEEE, 2022

work page 2022

[66] [66]

Summary on the ICASSP 2022 multi-channel multi- party meeting transcription grand challenge,

——, “Summary on the ICASSP 2022 multi-channel multi- party meeting transcription grand challenge,” inProc. of ICASSP. IEEE, 2022

work page 2022

[67] [67]

The third DIHARD diarization challenge,

N. Ryantet al., “The third DIHARD diarization challenge,” in Proc. of Interspeech, 2021, pp. 3570–3574

work page 2021

[68] [68]

MSDWild: Multi-modal speaker diarization dataset in the wild,

T. Liuet al., “MSDWild: Multi-modal speaker diarization dataset in the wild,” inProc. of Interspeech, 2022, pp. 1476–1480

work page 2022

[69] [69]

Can we really repurpose multi-speaker ASR corpus for speaker diarization?

S. Horiguchi, N. Tawara, T. Ashihara, A. Ando, and M. Delcroix, “Can we really repurpose multi-speaker ASR corpus for speaker diarization?” inProc. of ASRU, Dec 2025

work page 2025

[70] [70]

Spot the conversation: speaker diarisation in the wild,

J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman, “Spot the conversation: speaker diarisation in the wild,” inProc. of Interspeech, 2020, pp. 299–303

work page 2020

[71] [71]

MeetEval: A toolkit for computation of word error rates for meeting transcription systems,

T. v. Neumann, C. B. Boeddeker, M. Delcroix, and R. Haeb- Umbach, “MeetEval: A toolkit for computation of word error rates for meeting transcription systems,” inProc. of CHiME, 2023, pp. 27–32

work page 2023

[72] [72]

NeMo: a toolkit for building AI applications using neural modules,

O. Kuchaievet al., “NeMo: a toolkit for building AI applications using neural modules,”arXiv preprint arXiv:1909.09577, 2019

work page arXiv 1909