Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder

Chyi-Jiunn Lin; Muhammad Shakeel; Shinji Watanabe; Yifan Peng; Yui Sudo

arxiv: 2508.20474 · v2 · pith:N34ADC4Cnew · submitted 2025-08-28 · 📡 eess.AS · cs.CL· cs.SD

Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder

Muhammad Shakeel , Yui Sudo , Yifan Peng , Chyi-Jiunn Lin , Shinji Watanabe This is my paper

Pith reviewed 2026-05-18 21:16 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords multi-speaker encoderspeaker diarizationspeech separationmulti-speaker ASRjoint trainingoverlapping speechLibriMix

0 comments

The pith

A unified multi-speaker encoder jointly trained on diarization, separation, and recognition outperforms single-task models on overlapping speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a shared speech encoder that learns representations usable for three related tasks at once: determining who speaks when in a recording, separating mixed voices, and transcribing what multiple speakers say. Representations from several layers of this encoder are combined through a residual weighted sum to pull in information at different levels of abstraction. Joint training lets the model exploit how these tasks depend on one another rather than solving them in isolation. Results on LibriMix show lower error rates than models trained separately for each task, with the largest reported gains in speaker diarization.

Core claim

The unified multi-speaker encoder jointly learns representations for speaker diarization, speech separation, and multi-speaker automatic speech recognition using a shared speech foundational encoder. Hidden representations from multiple layers are combined as a residual weighted-sum encoding to align information across semantic levels and capture interdependencies among the tasks, leading to improved performance on overlapping speech data.

What carries the argument

Unified multi-speaker encoder with residual weighted-sum encoding from multiple layers, which supplies bottom-up alignment across semantic levels for the three tasks.

Load-bearing premise

Joint training on the three tasks will create useful synergies without harmful interference between them, and weighting representations from multiple encoder layers will align the tasks effectively.

What would settle it

Evaluation on a dataset containing four or more simultaneous speakers or on real meeting recordings with background noise, checking whether diarization error rates rise above the reported 1.37 percent and 2.29 percent figures.

Figures

Figures reproduced from arXiv: 2508.20474 by Chyi-Jiunn Lin, Muhammad Shakeel, Shinji Watanabe, Yifan Peng, Yui Sudo.

**Figure 1.** Figure 1: shows the overall framework of UME. It leverages the hidden representations through an RWSE of intermediate layers, which act as a bridge between SD, SS, and multispeaker ASR tasks. This enables a comprehensive and detailed interaction from each layer of the SFM encoder. Note that our goal is not to develop a new encoder or speech processing tasks; in principle, one can apply any SFM encoder, SD, SS, or m… view at source ↗

**Figure 2.** Figure 2: Separation results of two speaker mixtures. (a) Input [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Separation results of three speaker mixtures. (a) Input speech mixture of three speakers and WHAM! noise (speaker1, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

This paper presents a unified multi-speaker encoder (UME), a novel architecture that jointly learns representations for speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) tasks using a shared speech foundational encoder. We leverage the hidden representations from multiple layers of UME as a residual weighted-sum encoding (RWSE) to effectively use information from different semantic levels, contributing to bottom-up alignment between tasks. This joint training approach captures the inherent interdependencies among the tasks, enhancing overall performance on overlapping speech data. Our evaluations demonstrate that UME substantially improves over the single-task baselines dedicated to SD, SS, and multi-speaker ASR on LibriMix evaluation sets. Notably, for SD, UME outperforms the previous studies, achieving diarization error rates of 1.37% and 2.29% on Libri2Mix and Libri3Mix evaluation sets, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shared multi-speaker encoder delivers competitive LibriMix numbers but the benefits of joint training over single-task models still need ablations to confirm.

read the letter

The main point to take away is that a shared foundational encoder with residual weighted-sum features from multiple layers can deliver lower error rates across speaker diarization, speech separation, and multi-speaker ASR on LibriMix sets. The reported 1.37% DER on Libri2Mix and 2.29% on Libri3Mix stand out as better than prior single-task work. What the paper actually does is train one encoder jointly on the three tasks and use the RWSE to combine layer outputs for better bottom-up alignment. This builds on multi-task learning ideas but applies them specifically to these overlapping-speech problems with a concrete architecture. The gains over dedicated baselines are the concrete contribution here, and using a speech foundational encoder as the base keeps the approach practical rather than starting from random weights. The soft spot is the lack of detail on how the joint optimization avoids negative transfer. The abstract mentions capturing interdependencies but does not include loss weight schedules, gradient diagnostics, or ablations that would show whether one task crowds out the others or if the RWSE weights actually distribute across layers. If the improvements hold only under particular hyperparameter choices, that limits how much we can credit the unification itself rather than just the model capacity. This paper is for speech researchers who work on practical pipelines for multi-speaker audio and want to see if one model can replace three separate ones. A reader looking for new theoretical insights or formal proofs will not find much, but someone evaluating system-level simplifications might get value from the benchmark results on standard LibriMix data. The math and data look standard for the field, with held-out evaluations and comparisons to previous studies. The citation pattern seems to reference relevant prior work on separate models. I would send this to peer review because the empirical claims are testable and the architecture is described enough to reproduce the setup, even if more controls would strengthen the case for the joint training approach.

Referee Report

1 major / 2 minor

Summary. The paper introduces a Unified Multi-speaker Encoder (UME) that jointly trains a shared foundational speech encoder for speaker diarization (SD), speech separation (SS), and multi-speaker ASR. It uses residual weighted-sum encoding (RWSE) from multiple encoder layers to align semantic levels across tasks and reports empirical gains over single-task baselines on LibriMix evaluation sets, including diarization error rates of 1.37% on Libri2Mix and 2.29% on Libri3Mix.

Significance. If the gains prove robust, the work could advance efficient multi-task speech processing by exploiting inter-task synergies on overlapping speech. The held-out evaluation on standard LibriMix benchmarks is a positive element of the empirical assessment.

major comments (1)

[§4] §4 (Experiments): The central claim that joint multi-task training with RWSE produces beneficial synergies rests on comparisons to dedicated single-task baselines, yet the manuscript provides no ablations that control for model capacity, training schedule, or loss-weighting effects, nor any diagnostics for gradient conflicts or negative transfer. This leaves open whether the reported 1.37% DER on Libri2Mix is attributable to unification or to other optimization factors.

minor comments (2)

[§3.2] The RWSE formulation in §3.2 would benefit from an explicit equation showing how layer weights are learned and applied, to aid reproducibility.
[Figure 1] Figure 1: The architecture diagram could more clearly label the residual connections and task-specific output heads.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that stronger controls are needed to substantiate the source of the reported gains and will revise the experiments section accordingly.

read point-by-point responses

Referee: [§4] §4 (Experiments): The central claim that joint multi-task training with RWSE produces beneficial synergies rests on comparisons to dedicated single-task baselines, yet the manuscript provides no ablations that control for model capacity, training schedule, or loss-weighting effects, nor any diagnostics for gradient conflicts or negative transfer. This leaves open whether the reported 1.37% DER on Libri2Mix is attributable to unification or to other optimization factors.

Authors: We acknowledge the validity of this concern. The single-task baselines in the current manuscript use the same encoder backbone but were not explicitly matched on every hyperparameter. In the revision we will add capacity-matched ablations (identical parameter count and layer configuration for single-task models), loss-weight sweeps, and training-schedule controls. We will also include gradient-norm diagnostics across tasks to assess potential conflicts or negative transfer. These additions will allow readers to better attribute the 1.37 % DER improvement to the joint training and RWSE mechanism. revision: yes

Circularity Check

0 steps flagged

Empirical multi-task unification with no derivation circularity

full rationale

The paper introduces UME as a shared-encoder architecture for joint SD/SS/ASR training with RWSE, then reports held-out LibriMix metrics (e.g., 1.37% DER on Libri2Mix) that exceed single-task baselines. No equations, loss terms, or self-citations reduce these gains to quantities defined by the paper's own fitted parameters or internal re-derivations. The central claim rests on external benchmark comparisons rather than any self-referential construction, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of joint training and the RWSE mechanism; no explicit mathematical axioms are stated, but the approach implicitly assumes that a shared foundational encoder can be adapted without catastrophic forgetting across tasks and that LibriMix mixtures are representative of real overlapping speech.

free parameters (2)

layer weights in RWSE
The residual weighted-sum encoding requires learned or tuned weights for combining hidden representations from multiple encoder layers; these are fitted during joint training.
task-specific loss weights
Balancing the diarization, separation, and ASR losses during multi-task optimization introduces additional scalar hyperparameters that are chosen or tuned.

axioms (1)

domain assumption A pre-trained speech foundational encoder provides useful hierarchical representations that can be shared across SD, SS, and ASR without major negative transfer.
Invoked when the paper states that the shared encoder jointly learns representations for the three tasks.

pith-pipeline@v0.9.0 · 5708 in / 1569 out tokens · 35550 ms · 2026-05-18T21:16:54.052253+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

joint training approach captures the inherent inter-dependencies among the tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

[1]

A review of speaker diarization: Recent advances with deep learning,

T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan, “A review of speaker diarization: Recent advances with deep learning,” Computer speech & language, vol. 72, p. 101317, 2022

work page 2022
[2]

Encoder-decoder based attractors for end-to-end neural diarization,

S. Horiguchi et al. , “Encoder-decoder based attractors for end-to-end neural diarization,” IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 30, pp. 1493–1507, 2022

work page 2022
[3]

Powerset multi-class cross entropy loss for neural speaker diarization,

A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” in Proc. Interspeech, 2023, pp. 3222–3226

work page 2023
[4]

Supervised speech separation based on deep learning: An overview,

D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Pro- cess., vol. 26, no. 10, pp. 1702–1726, 2018

work page 2018
[5]

Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,

Y . Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 27, no. 8, pp. 1256–1266, 2019

work page 2019
[6]

TF-GRIDNET: Making time- frequency domain models great again for monaural speaker separation,

Z.-Q. Wang, S. Cornell, S. Choi et al. , “TF-GRIDNET: Making time- frequency domain models great again for monaural speaker separation,” in Proc. ICASSP, 2023, pp. 1–5

work page 2023
[7]

Single-channel multi-talker speech recognition with permutation invariant training,

Y . Qian, X. Chang, and D. Yu, “Single-channel multi-talker speech recognition with permutation invariant training,” Speech Communica- tion, vol. 104, pp. 1–11, 2018

work page 2018
[8]

A purely end-to-end system for multi-speaker speech recognition,

H. Seki, T. Hori, S. Watanabe, J. Le Roux, and J. R. Hershey, “A purely end-to-end system for multi-speaker speech recognition,” in Proc. ACL, Melbourne, Australia, Jul. 2018, pp. 2620–2630

work page 2018
[9]

End-to-end multi-speaker speech recognition with transformer,

X. Chang, W. Zhang, Y . Qian, J. L. Roux, and S. Watanabe, “End-to-end multi-speaker speech recognition with transformer,” in Proc. ICASSP , 2020, pp. 6134–6138

work page 2020
[10]

Serialized output training for end-to-end overlapped speech recognition,

N. Kanda, Y . Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Serialized output training for end-to-end overlapped speech recognition,” in Proc. Interspeech, 2020, pp. 2797–2801

work page 2020
[11]

Integration of speech separation, diarization, and recog- nition for multi-speaker meetings: System description, comparison, and analysis,

D. Raj et al., “Integration of speech separation, diarization, and recog- nition for multi-speaker meetings: System description, comparison, and analysis,” in Proc. SLT, 2021, pp. 897–904

work page 2021
[12]

Continuous speech separation: Dataset and analysis,

Z. Chen, T. Yoshioka, L. Lu et al. , “Continuous speech separation: Dataset and analysis,” in Proc. ICASSP, 2020, pp. 7284–7288

work page 2020
[13]

CHiME-6 Challenge: Tackling multispeaker speech recognition for unsegmented recordings,

S. Watanabe, M. Mandel, J. Barker, E. Vincent et al. , “CHiME-6 Challenge: Tackling multispeaker speech recognition for unsegmented recordings,” in Proc. CHiME, 2020, pp. 1–7

work page 2020
[14]

Tandem multitask training of speaker diarisation and speech recognition for meeting transcription,

X. Zheng, C. Zhang, and P. Woodland, “Tandem multitask training of speaker diarisation and speech recognition for meeting transcription,” in Proc. Interspeech, 2022, pp. 3844–3848

work page 2022
[15]

pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,

H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” in Proc. Interspeech, 2023, pp. 1983–1987

work page 2023
[16]

TS-SEP: Joint di- arization and separation conditioned on estimated speaker embeddings,

C. Boeddeker, A. S. Subramanian, G. Wichern et al., “TS-SEP: Joint di- arization and separation conditioned on estimated speaker embeddings,” in IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 32, 2024, pp. 1185–1197

work page 2024
[17]

PixIT: Joint training of speaker diarization and speech separation from real-world multi-speaker recordings,

J. Kalda et al., “PixIT: Joint training of speaker diarization and speech separation from real-world multi-speaker recordings,” in Proc. Odyssey, 2024, pp. 115–122

work page 2024
[18]

Adapting multi-lingual asr models for handling multiple talkers,

C. Li, Y . Qian, Z. Chen, N. Kanda, D. Wang, T. Yoshioka, Y . Qian, and M. Zeng, “Adapting multi-lingual asr models for handling multiple talkers,” in Proc. Interspeech, 2023, pp. 1314–1318

work page 2023
[19]

Speech recog- nition and multi-speaker diarization of long conversations,

H. H. Mao, S. Li, J. McAuley, and G. W. Cottrell, “Speech recog- nition and multi-speaker diarization of long conversations,” in Proc. Interspeech, 2020, pp. 691–695

work page 2020
[20]

One model to rule them all ? towards end-to-end joint speaker diarization and speech recognition,

S. Cornell, J.-W. Jung, S. Watanabe, and S. Squartini, “One model to rule them all ? towards end-to-end joint speaker diarization and speech recognition,” in Proc. ICASSP, 2024, pp. 11 856–11 860

work page 2024
[21]

Streaming speaker-attributed ASR with token-level speaker embeddings,

N. Kanda et al. , “Streaming speaker-attributed ASR with token-level speaker embeddings,” in Proc. Interspeech, 2022, pp. 521–525

work page 2022
[22]

MIMO-Speech: End-to-end multi-channel multi- speaker speech recognition,

X. Chang et al. , “MIMO-Speech: End-to-end multi-channel multi- speaker speech recognition,” in Proc. ASRU, 2019, pp. 237–244

work page 2019
[23]

Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR,

T. von Neumann et al. , “Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR,” in Proc. Interspeech, 2020, pp. 3097–3101

work page 2020
[24]

All-neural online source separation, counting, and diarization for meeting analysis,

——, “All-neural online source separation, counting, and diarization for meeting analysis,” in Proc. ICASSP, 2019, pp. 91–95

work page 2019
[25]

Neural blind source separa- tion and diarization for distant speech recognition,

Y . Bando, T. Nakamura, and S. Watanabe, “Neural blind source separa- tion and diarization for distant speech recognition,” in Proc. Interspeech, 2024, pp. 722–726

work page 2024
[26]

Stcon system for the chime-8 challenge,

A. Mitrofanov, T. Prisyach, T. Timofeeva et al., “Stcon system for the chime-8 challenge,” in Proc. CHiME, 2024, pp. 13–17

work page 2024
[27]

BUT/JHU system description for CHiME-8 NOTSOFAR-1 challenge,

A. Polok, D. Klement, J. Han, ˇSimon Sedl´aˇcek, B. Yusuf, M. Maciejew- ski, M. S. Wiesner, and L. Burget, “BUT/JHU system description for CHiME-8 NOTSOFAR-1 challenge,” inProc. CHiME, 2024, pp. 18–22

work page 2024
[28]

The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,

S. Niu, R. Wang, J. Du et al., “The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,” inProc. CHiME, 2024, pp. 31–36

work page 2024
[29]

NTT multi-speaker asr system for the DASR task of CHiME-8 challenge,

N. Kamo, N. Tawara, A. Ando et al. , “NTT multi-speaker asr system for the DASR task of CHiME-8 challenge,” in Proc. CHiME, 2024, pp. 69–74

work page 2024
[30]

wav2vec 2.0: a framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representations,” in Proc. NeurIPS, ser. NIPS ’20, 2020

work page 2020
[31]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu et al. , “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” in IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 29, Oct. 2021, p. 3451–3460

work page 2021
[32]

WavLM: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” in IEEE J. Sel. Topics Signal Process., vol. 16, no. 6, 2022, pp. 1505–1518

work page 2022
[33]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu et al. , “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, vol. 202, 23–29 Jul 2023, pp. 28 492–28 518

work page 2023
[34]

OWSM-CTC: An open encoder-only speech foundation model for speech recognition, translation, and language identification,

Y . Peng, Y . Sudo, M. Shakeel, and S. Watanabe, “OWSM-CTC: An open encoder-only speech foundation model for speech recognition, translation, and language identification,” in Proc. ACL, Aug. 2024, pp. 10 192–10 209

work page 2024
[35]

SUPERB: Speech processing universal performance benchmark,

S.Yang et al. , “SUPERB: Speech processing universal performance benchmark,” in Proc. Interspeech, 2021, pp. 1194–1198

work page 2021
[36]

OWSM v3.1: Better and faster open whisper-style speech models based on e-branchformer,

Y . Peng, J. Tian, W. Chen et al. , “OWSM v3.1: Better and faster open whisper-style speech models based on e-branchformer,” in Proc. Interspeech, 2024, pp. 352–356

work page 2024
[37]

LibriMix: An open-source dataset for generalizable speech separation,

J. Cosentino, M. Pariente, S. Cornell et al., “LibriMix: An open-source dataset for generalizable speech separation,” 2020

work page 2020
[38]

E-Branchformer: Branchformer with enhanced merging for speech recognition,

K. Kim, F. Wu, Y . Peng et al. , “E-Branchformer: Branchformer with enhanced merging for speech recognition,” in Proc. SLT, 2023, pp. 84– 91

work page 2023
[39]

End-to-end training of time domain audio separation and recognition,

T. von Neumann et al. , “End-to-end training of time domain audio separation and recognition,” in Proc. ICASSP, 2020, pp. 7004–7008

work page 2020
[40]

The AMI meeting corpus: A pre-announcement,

J. Carletta, S. Ashby, S. Bourban et al., “The AMI meeting corpus: A pre-announcement,” in Machine Learning for Multimodal Interaction , 2006, pp. 28–39

work page 2006
[41]

The Hitachi-JHU DIHARD III System: Competitive end-to-end neural diarization and x-vector clustering systems combined by dover-lap,

S. Horiguchi, N. Yalta, P. Garcia et al. , “The Hitachi-JHU DIHARD III System: Competitive end-to-end neural diarization and x-vector clustering systems combined by dover-lap,” 2021

work page 2021
[42]

The rich transcription 2006 spring meeting recognition evaluation,

J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo, “The rich transcription 2006 spring meeting recognition evaluation,” in Machine Learning for Multimodal Interaction , 2006, pp. 309–322

work page 2006
[43]

A short- time objective intelligibility measure for time-frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. ICASSP, 2010, pp. 4214–4217

work page 2010
[44]

Performance measurement in blind audio source separation,

E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006

work page 2006
[45]

Streaming end-to-end multi-talker speech recognition,

L. Lu, N. Kanda, J. Li, and Y . Gong, “Streaming end-to-end multi-talker speech recognition,” IEEE Signal Process. Lett. , vol. 28, pp. 803–807, 2021

work page 2021
[46]

End-to-end Speaker-Attributed ASR with transformer,

N. Kanda, G. Ye, Y . Gaur, X. Wang, Z. Meng, Z. Chen, and T. Yosh- ioka, “End-to-end Speaker-Attributed ASR with transformer,” in Proc. Interspeech, 2021, pp. 4413–4417

work page 2021
[47]

Empowering whisper as a joint multi- talker and target-talker speech recognition system,

L. Meng, J. Kang, Y . Wang et al., “Empowering whisper as a joint multi- talker and target-talker speech recognition system,” in Proc. Interspeech, 2024, pp. 4653–4657

work page 2024
[48]

ESPnet: End-to-end speech processing toolkit,

S. Watanabe, T. Hori, S. Karita et al. , “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211

work page 2018
[49]

The power of the weighted sum scalarization for approximating multiobjective optimization problems,

C. Bazgan et al. , “The power of the weighted sum scalarization for approximating multiobjective optimization problems,” Theory of Com- puting Systems, vol. 66, no. 1, pp. 395–415, Feb 2022

work page 2022
[50]

Joint beam search integrating CTC, attention, and trans- ducer decoders,

Y . Sudo, M. Shakeel, Y . Fukumoto, B. Yan, J. Shi, Y . Peng, and S. Watanabe, “Joint beam search integrating CTC, attention, and trans- ducer decoders,” IEEE Trans. Audio, Speech, Lang. Process. , vol. 33, pp. 598–612, 2025

work page 2025

[1] [1]

A review of speaker diarization: Recent advances with deep learning,

T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan, “A review of speaker diarization: Recent advances with deep learning,” Computer speech & language, vol. 72, p. 101317, 2022

work page 2022

[2] [2]

Encoder-decoder based attractors for end-to-end neural diarization,

S. Horiguchi et al. , “Encoder-decoder based attractors for end-to-end neural diarization,” IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 30, pp. 1493–1507, 2022

work page 2022

[3] [3]

Powerset multi-class cross entropy loss for neural speaker diarization,

A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” in Proc. Interspeech, 2023, pp. 3222–3226

work page 2023

[4] [4]

Supervised speech separation based on deep learning: An overview,

D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Pro- cess., vol. 26, no. 10, pp. 1702–1726, 2018

work page 2018

[5] [5]

Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,

Y . Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 27, no. 8, pp. 1256–1266, 2019

work page 2019

[6] [6]

TF-GRIDNET: Making time- frequency domain models great again for monaural speaker separation,

Z.-Q. Wang, S. Cornell, S. Choi et al. , “TF-GRIDNET: Making time- frequency domain models great again for monaural speaker separation,” in Proc. ICASSP, 2023, pp. 1–5

work page 2023

[7] [7]

Single-channel multi-talker speech recognition with permutation invariant training,

Y . Qian, X. Chang, and D. Yu, “Single-channel multi-talker speech recognition with permutation invariant training,” Speech Communica- tion, vol. 104, pp. 1–11, 2018

work page 2018

[8] [8]

A purely end-to-end system for multi-speaker speech recognition,

H. Seki, T. Hori, S. Watanabe, J. Le Roux, and J. R. Hershey, “A purely end-to-end system for multi-speaker speech recognition,” in Proc. ACL, Melbourne, Australia, Jul. 2018, pp. 2620–2630

work page 2018

[9] [9]

End-to-end multi-speaker speech recognition with transformer,

X. Chang, W. Zhang, Y . Qian, J. L. Roux, and S. Watanabe, “End-to-end multi-speaker speech recognition with transformer,” in Proc. ICASSP , 2020, pp. 6134–6138

work page 2020

[10] [10]

Serialized output training for end-to-end overlapped speech recognition,

N. Kanda, Y . Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Serialized output training for end-to-end overlapped speech recognition,” in Proc. Interspeech, 2020, pp. 2797–2801

work page 2020

[11] [11]

Integration of speech separation, diarization, and recog- nition for multi-speaker meetings: System description, comparison, and analysis,

D. Raj et al., “Integration of speech separation, diarization, and recog- nition for multi-speaker meetings: System description, comparison, and analysis,” in Proc. SLT, 2021, pp. 897–904

work page 2021

[12] [12]

Continuous speech separation: Dataset and analysis,

Z. Chen, T. Yoshioka, L. Lu et al. , “Continuous speech separation: Dataset and analysis,” in Proc. ICASSP, 2020, pp. 7284–7288

work page 2020

[13] [13]

CHiME-6 Challenge: Tackling multispeaker speech recognition for unsegmented recordings,

S. Watanabe, M. Mandel, J. Barker, E. Vincent et al. , “CHiME-6 Challenge: Tackling multispeaker speech recognition for unsegmented recordings,” in Proc. CHiME, 2020, pp. 1–7

work page 2020

[14] [14]

Tandem multitask training of speaker diarisation and speech recognition for meeting transcription,

X. Zheng, C. Zhang, and P. Woodland, “Tandem multitask training of speaker diarisation and speech recognition for meeting transcription,” in Proc. Interspeech, 2022, pp. 3844–3848

work page 2022

[15] [15]

pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,

H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” in Proc. Interspeech, 2023, pp. 1983–1987

work page 2023

[16] [16]

TS-SEP: Joint di- arization and separation conditioned on estimated speaker embeddings,

C. Boeddeker, A. S. Subramanian, G. Wichern et al., “TS-SEP: Joint di- arization and separation conditioned on estimated speaker embeddings,” in IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 32, 2024, pp. 1185–1197

work page 2024

[17] [17]

PixIT: Joint training of speaker diarization and speech separation from real-world multi-speaker recordings,

J. Kalda et al., “PixIT: Joint training of speaker diarization and speech separation from real-world multi-speaker recordings,” in Proc. Odyssey, 2024, pp. 115–122

work page 2024

[18] [18]

Adapting multi-lingual asr models for handling multiple talkers,

C. Li, Y . Qian, Z. Chen, N. Kanda, D. Wang, T. Yoshioka, Y . Qian, and M. Zeng, “Adapting multi-lingual asr models for handling multiple talkers,” in Proc. Interspeech, 2023, pp. 1314–1318

work page 2023

[19] [19]

Speech recog- nition and multi-speaker diarization of long conversations,

H. H. Mao, S. Li, J. McAuley, and G. W. Cottrell, “Speech recog- nition and multi-speaker diarization of long conversations,” in Proc. Interspeech, 2020, pp. 691–695

work page 2020

[20] [20]

One model to rule them all ? towards end-to-end joint speaker diarization and speech recognition,

S. Cornell, J.-W. Jung, S. Watanabe, and S. Squartini, “One model to rule them all ? towards end-to-end joint speaker diarization and speech recognition,” in Proc. ICASSP, 2024, pp. 11 856–11 860

work page 2024

[21] [21]

Streaming speaker-attributed ASR with token-level speaker embeddings,

N. Kanda et al. , “Streaming speaker-attributed ASR with token-level speaker embeddings,” in Proc. Interspeech, 2022, pp. 521–525

work page 2022

[22] [22]

MIMO-Speech: End-to-end multi-channel multi- speaker speech recognition,

X. Chang et al. , “MIMO-Speech: End-to-end multi-channel multi- speaker speech recognition,” in Proc. ASRU, 2019, pp. 237–244

work page 2019

[23] [23]

Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR,

T. von Neumann et al. , “Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR,” in Proc. Interspeech, 2020, pp. 3097–3101

work page 2020

[24] [24]

All-neural online source separation, counting, and diarization for meeting analysis,

——, “All-neural online source separation, counting, and diarization for meeting analysis,” in Proc. ICASSP, 2019, pp. 91–95

work page 2019

[25] [25]

Neural blind source separa- tion and diarization for distant speech recognition,

Y . Bando, T. Nakamura, and S. Watanabe, “Neural blind source separa- tion and diarization for distant speech recognition,” in Proc. Interspeech, 2024, pp. 722–726

work page 2024

[26] [26]

Stcon system for the chime-8 challenge,

A. Mitrofanov, T. Prisyach, T. Timofeeva et al., “Stcon system for the chime-8 challenge,” in Proc. CHiME, 2024, pp. 13–17

work page 2024

[27] [27]

BUT/JHU system description for CHiME-8 NOTSOFAR-1 challenge,

A. Polok, D. Klement, J. Han, ˇSimon Sedl´aˇcek, B. Yusuf, M. Maciejew- ski, M. S. Wiesner, and L. Burget, “BUT/JHU system description for CHiME-8 NOTSOFAR-1 challenge,” inProc. CHiME, 2024, pp. 18–22

work page 2024

[28] [28]

The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,

S. Niu, R. Wang, J. Du et al., “The USTC-NERCSLIP systems for the CHiME-8 NOTSOFAR-1 challenge,” inProc. CHiME, 2024, pp. 31–36

work page 2024

[29] [29]

NTT multi-speaker asr system for the DASR task of CHiME-8 challenge,

N. Kamo, N. Tawara, A. Ando et al. , “NTT multi-speaker asr system for the DASR task of CHiME-8 challenge,” in Proc. CHiME, 2024, pp. 69–74

work page 2024

[30] [30]

wav2vec 2.0: a framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representations,” in Proc. NeurIPS, ser. NIPS ’20, 2020

work page 2020

[31] [31]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu et al. , “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” in IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 29, Oct. 2021, p. 3451–3460

work page 2021

[32] [32]

WavLM: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” in IEEE J. Sel. Topics Signal Process., vol. 16, no. 6, 2022, pp. 1505–1518

work page 2022

[33] [33]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu et al. , “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, vol. 202, 23–29 Jul 2023, pp. 28 492–28 518

work page 2023

[34] [34]

OWSM-CTC: An open encoder-only speech foundation model for speech recognition, translation, and language identification,

Y . Peng, Y . Sudo, M. Shakeel, and S. Watanabe, “OWSM-CTC: An open encoder-only speech foundation model for speech recognition, translation, and language identification,” in Proc. ACL, Aug. 2024, pp. 10 192–10 209

work page 2024

[35] [35]

SUPERB: Speech processing universal performance benchmark,

S.Yang et al. , “SUPERB: Speech processing universal performance benchmark,” in Proc. Interspeech, 2021, pp. 1194–1198

work page 2021

[36] [36]

OWSM v3.1: Better and faster open whisper-style speech models based on e-branchformer,

Y . Peng, J. Tian, W. Chen et al. , “OWSM v3.1: Better and faster open whisper-style speech models based on e-branchformer,” in Proc. Interspeech, 2024, pp. 352–356

work page 2024

[37] [37]

LibriMix: An open-source dataset for generalizable speech separation,

J. Cosentino, M. Pariente, S. Cornell et al., “LibriMix: An open-source dataset for generalizable speech separation,” 2020

work page 2020

[38] [38]

E-Branchformer: Branchformer with enhanced merging for speech recognition,

K. Kim, F. Wu, Y . Peng et al. , “E-Branchformer: Branchformer with enhanced merging for speech recognition,” in Proc. SLT, 2023, pp. 84– 91

work page 2023

[39] [39]

End-to-end training of time domain audio separation and recognition,

T. von Neumann et al. , “End-to-end training of time domain audio separation and recognition,” in Proc. ICASSP, 2020, pp. 7004–7008

work page 2020

[40] [40]

The AMI meeting corpus: A pre-announcement,

J. Carletta, S. Ashby, S. Bourban et al., “The AMI meeting corpus: A pre-announcement,” in Machine Learning for Multimodal Interaction , 2006, pp. 28–39

work page 2006

[41] [41]

The Hitachi-JHU DIHARD III System: Competitive end-to-end neural diarization and x-vector clustering systems combined by dover-lap,

S. Horiguchi, N. Yalta, P. Garcia et al. , “The Hitachi-JHU DIHARD III System: Competitive end-to-end neural diarization and x-vector clustering systems combined by dover-lap,” 2021

work page 2021

[42] [42]

The rich transcription 2006 spring meeting recognition evaluation,

J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo, “The rich transcription 2006 spring meeting recognition evaluation,” in Machine Learning for Multimodal Interaction , 2006, pp. 309–322

work page 2006

[43] [43]

A short- time objective intelligibility measure for time-frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. ICASSP, 2010, pp. 4214–4217

work page 2010

[44] [44]

Performance measurement in blind audio source separation,

E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006

work page 2006

[45] [45]

Streaming end-to-end multi-talker speech recognition,

L. Lu, N. Kanda, J. Li, and Y . Gong, “Streaming end-to-end multi-talker speech recognition,” IEEE Signal Process. Lett. , vol. 28, pp. 803–807, 2021

work page 2021

[46] [46]

End-to-end Speaker-Attributed ASR with transformer,

N. Kanda, G. Ye, Y . Gaur, X. Wang, Z. Meng, Z. Chen, and T. Yosh- ioka, “End-to-end Speaker-Attributed ASR with transformer,” in Proc. Interspeech, 2021, pp. 4413–4417

work page 2021

[47] [47]

Empowering whisper as a joint multi- talker and target-talker speech recognition system,

L. Meng, J. Kang, Y . Wang et al., “Empowering whisper as a joint multi- talker and target-talker speech recognition system,” in Proc. Interspeech, 2024, pp. 4653–4657

work page 2024

[48] [48]

ESPnet: End-to-end speech processing toolkit,

S. Watanabe, T. Hori, S. Karita et al. , “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211

work page 2018

[49] [49]

The power of the weighted sum scalarization for approximating multiobjective optimization problems,

C. Bazgan et al. , “The power of the weighted sum scalarization for approximating multiobjective optimization problems,” Theory of Com- puting Systems, vol. 66, no. 1, pp. 395–415, Feb 2022

work page 2022

[50] [50]

Joint beam search integrating CTC, attention, and trans- ducer decoders,

Y . Sudo, M. Shakeel, Y . Fukumoto, B. Yan, J. Shi, Y . Peng, and S. Watanabe, “Joint beam search integrating CTC, attention, and trans- ducer decoders,” IEEE Trans. Audio, Speech, Lang. Process. , vol. 33, pp. 598–612, 2025

work page 2025