pith. sign in

arxiv: 2606.11795 · v1 · pith:WEL3PNLJnew · submitted 2026-06-10 · 📡 eess.AS · cs.SD

Tight Boundary Prediction in Speaker Diarization Using Causal-Anticausal Consistency

Pith reviewed 2026-06-27 08:33 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords speaker diarizationboundary predictioncausal modelsanticausal modelspseudo labelsco-trainingloose annotations
0
0 comments X

The pith

Causal and anticausal models generate tighter speech boundaries from loose labels by avoiding learned looseness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to enable speaker diarization models to output tight speech intervals even when trained only on loose annotations that include pauses and margins. It does so by training separate causal and anticausal models on the same loose data and using their outputs as pseudo labels that naturally exclude the loosening patterns. A co-training loop then refines both models and the labels together. A sympathetic reader would care because many downstream tasks prefer precise segment boundaries and because the method avoids the need for new tightly annotated training sets.

Core claim

The central discovery is that causal and anticausal models are unable to reproduce the loosening behavior present in loose labels, so their predictions can serve as tighter pseudo labels; an iterative co-training procedure that alternates between label tightening and model updates then recovers roughly 70 percent of the boundary-tightening benefit that would be obtained from ideal tight-label training and yields measurable gains on downstream automatic speech recognition.

What carries the argument

Causal-anticausal consistency: the agreement between a forward (causal) model and a backward (anticausal) model supplies the tightened pseudo labels that drive the co-training loop.

If this is right

  • The co-training recovers about 70 percent of the tightening effect obtained by training on ideal tight labels.
  • Downstream automatic speech recognition performance improves when the diarization output uses the tightened boundaries.
  • Iterative refinement produces progressively tighter labels without external supervision.
  • The same loose-labeled data can support both standard and tight-boundary diarization models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency principle could be tested on other sequence tasks that suffer from margin-inflated annotations, such as action segmentation in video.
  • If the tightening effect scales with model size, very large causal-anticausal pairs might approach the performance of models trained on expensive tight labels.
  • The method supplies a concrete way to measure how much looseness a given architecture has internalized, which could guide architecture choices for boundary-sensitive applications.

Load-bearing premise

Causal and anticausal models are inherently incapable of learning loosening behavior from the loose labels.

What would settle it

Training a causal or anticausal model on the same loose labels and observing that its boundary predictions remain as loose as those of a standard bidirectional model would falsify the central premise.

Figures

Figures reproduced from arXiv: 2606.11795 by Atsushi Ando, Marc Delcroix, Naohiro Tawara, Shota Horiguchi, Takanori Ashihara.

Figure 1
Figure 1. Figure 1: Comparison of model behavior in conventional train￾ing and in this study. At present, the only way to train a model that can predict segments with tight boundaries is to provide tight labels as su￾pervision ( [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Behavior of causal and anticausal models in single-speaker cases. ing the previous studies [10], we only consider the overlap of at most two speakers, and thus restrict Rr such that |Rr| ≤ 2, which results in R = P2 i=0 S i  . In this paper, we set S = 4, so the set of possible classes is given by Rr ∈ {∅, {1}, {2}, {3}, {4}, {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}}. We de￾fine a mapping Q2P : Q 7→… view at source ↗
Figure 3
Figure 3. Figure 3: Example of label tightening in a two-speaker case. ward. As a result, it cannot determine whether a subsequent silent interval should be treated as part of a speech segment or as the end of speech. Consequently, the model becomes progres￾sively less confident after speech offset, resulting in lower pos￾terior probabilities. The anticausal model exhibits analogous behavior. It cannot pad the end of a speech… view at source ↗
Figure 4
Figure 4. Figure 4: DER (%) of causal and anticausal models after co￾training compared with non-causal models (P1–P3). ing predictions from being affected by unpredictable loosening behavior. Among the three tightening methods, the SC-based approach achieved DER reductions comparable to or greater than those of the other methods. Detailed analyses of each tight￾ening method are provided in Sec. 6.1.2. We also report the perfo… view at source ↗
Figure 7
Figure 7. Figure 7: Qualities of tightened training-set la￾bels obtained using ReDimNet-based models measured by DER against ideal tight labels. For each proposed method, the three bars (from top to bottom) represent vanilla, with restora￾tion, and with restoration and co-training. result was consistent with the discussion in Sec. 4.3.2 that VAD tightening behaves conservatively. The impact of this over￾tightening on downstre… view at source ↗
read the original abstract

Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on such data tend to internalize mechanisms that reproduce this looseness, although tight speech intervals are sometimes preferable for downstream applications. In this paper, we address the novel task of enabling models to produce tight predictions using loose labels. Our method generates tighter pseudo labels using causal and anticausal models, which are inherently incapable of learning loosening behavior. We further propose a co-training scheme that iteratively tightens labels and updates both models for more progressive refinement. Experimental results show that the proposed method recovers about 70 % of the tightening effect achieved by ideal tight-label training and improves downstream performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper addresses the task of producing tight speech boundary predictions for speaker diarization when models are trained on loose labels (which include pauses and margins for semantic continuity). It proposes generating tighter pseudo-labels via causal and anticausal models, which are asserted to be inherently incapable of learning loosening behavior, combined with an iterative co-training scheme that refines both models and labels. The abstract reports that the method recovers approximately 70% of the tightening effect obtained from ideal tight-label training and yields downstream performance gains.

Significance. If the directional-model assumption is substantiated and the reported recovery rate holds under controlled experiments with proper baselines, the approach could offer a label-efficient route to improved boundary precision in diarization without requiring re-annotation, directly benefiting downstream tasks such as multi-talker ASR. The consistency-based tightening idea is novel in this domain.

major comments (2)
  1. [Abstract] Abstract: the claim that causal and anticausal models are 'inherently incapable of learning loosening behavior' from loose labels is presented without derivation, ablation, or even a qualitative counter-example showing why unidirectional processing prevents reproduction of extra boundary margins. This assumption is load-bearing for the entire pseudo-label generation step; if the models can still learn loosening via acoustic cues or label statistics, the claimed tightening advantage collapses.
  2. [Abstract] Abstract: the 70% recovery figure and downstream gains are stated without any description of experimental setup, baselines (e.g., standard bidirectional model, oracle tight labels), metrics (DER, boundary F1, etc.), datasets, or error analysis, rendering the quantitative claim unverifiable from the given information.
minor comments (1)
  1. The co-training procedure and the precise mechanism for enforcing causal-anticausal consistency should be formalized with equations or pseudocode in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point-by-point below, with proposed revisions to the abstract where they improve clarity without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that causal and anticausal models are 'inherently incapable of learning loosening behavior' from loose labels is presented without derivation, ablation, or even a qualitative counter-example showing why unidirectional processing prevents reproduction of extra boundary margins. This assumption is load-bearing for the entire pseudo-label generation step; if the models can still learn loosening via acoustic cues or label statistics, the claimed tightening advantage collapses.

    Authors: The abstract is concise by design, but the full manuscript (Sections 2–3) derives the claim from the unidirectional architectures: a causal model has access only to past frames and cannot anticipate or reproduce future pauses/margins that loose labels encode for semantic continuity; an anticausal model lacks past context for the same reason. We include qualitative examples in the paper showing that even when trained on loose labels these models produce tighter boundaries than bidirectional counterparts. We will revise the abstract to incorporate a brief qualitative counter-example and a pointer to the architectural rationale. revision: partial

  2. Referee: [Abstract] Abstract: the 70% recovery figure and downstream gains are stated without any description of experimental setup, baselines (e.g., standard bidirectional model, oracle tight labels), metrics (DER, boundary F1, etc.), datasets, or error analysis, rendering the quantitative claim unverifiable from the given information.

    Authors: We agree the abstract omits experimental details due to length constraints. The full manuscript (Sections 4–5) specifies the datasets, metrics (DER and boundary F1), baselines (bidirectional loose-label model and oracle tight-label training), and the exact computation of the 70% recovery (fraction of the boundary-tightness gap closed relative to the loose-label baseline). Downstream gains are measured on multi-talker ASR. We will revise the abstract to add one sentence summarizing the evaluation protocol and recovery definition. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central method rests on the stated property that causal and anticausal models are inherently incapable of learning loosening behavior from loose labels, used to generate tighter pseudo-labels via co-training. No equations, fitting procedures, or self-citations are present in the provided text that reduce this claim or the reported 70% recovery to a self-referential quantity by construction. The derivation does not match any enumerated circularity pattern and remains self-contained, relying on directional model properties rather than inputs renamed as outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that causal and anticausal models cannot reproduce loosening behavior; no free parameters, new entities, or additional axioms are mentioned in the abstract.

axioms (1)
  • domain assumption Causal and anticausal models are inherently incapable of learning loosening behavior
    Invoked to justify generation of tighter pseudo labels from loose training data.

pith-pipeline@v0.9.1-grok · 5681 in / 1114 out tokens · 20958 ms · 2026-06-27T08:33:16.080770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 3 linked inside Pith

  1. [1]

    takes around twice the video duration

    Introduction Speaker diarization is the task of determining who is speaking when from an audio signal. It is typically addressed using mod- els that take audio as input and output speaker-wise speech seg- ments [1–8]. Training such models requires a large amount of data annotated with speaker-wise speech intervals, and in prac- tice, multi-talker automati...

  2. [2]

    Speaker diarization Training end-to-end neural diarization (EEND) models requires a large-scale dataset

    Related work 2.1. Speaker diarization Training end-to-end neural diarization (EEND) models requires a large-scale dataset. In early studies, since it was difficult to achieve large-scale training using only real data, simulated mix- tures were used for pretraining [1, 4–6]. In recent years, an increasing number of corpora have been developed for multi- ta...

  3. [3]

    Formulation of the conventional speaker diarization Speaker diarization is the problem of estimating speaker-wise speech activity at each frame

    Problem formulation 3.1. Formulation of the conventional speaker diarization Speaker diarization is the problem of estimating speaker-wise speech activity at each frame. LetX= [x 1, . . . ,xT ]∈R D×T denote frame-wiseD-dimensional acoustic features, whereT is the number of frames. GivenX, the goal is to estimate the speaker activities ˆY= [ˆy1, . . . ,ˆyT...

  4. [4]

    Concept As illustrated in Fig

    Proposed method 4.1. Concept As illustrated in Fig. 1, conventional training leads speaker di- arization models to estimate not only who is speaking when but also to pad segment boundaries and fill pauses. If the former functionality can be isolated, it would benefit downstream tasks. However, once these functions are internalized in a diarization model, ...

  5. [5]

    To determine how much to pad segment boundaries or how long to fill a silent interval, a model must first identify tightly bounded speech segments and then examine the surrounding context. Since diarization models are typically based on bidi- rectional [1], fully attentive [2,5], or large-receptive-field archi- tectures [33,43], such capabilities tend to ...

  6. [6]

    Experimental settings 5.1. Speaker diarization pipeline We adopted the EEND-vector clustering framework [4], which consists of i) local diarization with a 10-second window and a 1- second shift, ii) speaker embedding extraction for each detected speaker in each window, and iii) clustering of the speaker em- beddings to determine speaker correspondence acr...

  7. [7]

    Speaker diarization 6.1.1

    Results 6.1. Speaker diarization 6.1.1. Main results We first present the overall results on the in-domain corpora in Table 2. For the baseline models (B1), we use the annotated la- bels shown in Table 1 as supervision, whereas the topline mod- els (B2) use the ideally tightened labels via forced alignment for the ASR corpora, while the diarization corpor...

  8. [8]

    By leveraging the asymmetric padding properties of causal and anticausal models, the method refines loose anno- tations through co-training

    Conclusions In this paper, we proposed a training method for speaker diariza- tion models with loose ASR labels while enabling tight bound- ary inference. By leveraging the asymmetric padding properties of causal and anticausal models, the method refines loose anno- tations through co-training. This eliminates the need for costly manual annotation or acce...

  9. [9]

    All technical content was developed and verified by the authors

    Generative AI Use Disclosure The authors used a large language model (ChatGPT) only to assist with language editing and polishing. All technical content was developed and verified by the authors

  10. [10]

    End-to-end neural speaker diarization with permutation-free objectives,

    Y . Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watan- abe, “End-to-end neural speaker diarization with permutation-free objectives,” inProc. Interspeech, 2019, pp. 4300–4304

  11. [11]

    End-to-end neural speaker diarization with self- attention,

    Y . Fujita, N. Kanda, S. Horiguchi, Y . Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self- attention,” inProc. ASRU, 2019, pp. 296–303

  12. [12]

    Target- speaker voice activity detection: a novel approach for multi- speaker diarization in a dinner party scenario,

    I. Medennikov, M. Korenevsky, T. Prisyach, Y . Khokhlov, M. Ko- renevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. An- drusenko, I. Podluzhny, A. Laptev, and A. Romanenko, “Target- speaker voice activity detection: a novel approach for multi- speaker diarization in a dinner party scenario,” inProc. Inter- speech, 2020, pp. 274–278

  13. [13]

    Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,

    K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,” inProc. ICASSP, 2021, pp. 7198–7202

  14. [14]

    Encoder-decoder based attractors for end-to-end neural diariza- tion,

    S. Horiguchi, Y . Fujita, S. Watanabe, Y . Xue, and P. Garc ´ıa, “Encoder-decoder based attractors for end-to-end neural diariza- tion,”IEEE/ACM TASLP, vol. 30, pp. 1493–1507, 2022

  15. [15]

    Target speaker voice activity detection with transformers and its integra- tion with end-to-end neural diarization,

    D. Wang, X. Xiao, N. Kanda, T. Yoshioka, and J. Wu, “Target speaker voice activity detection with transformers and its integra- tion with end-to-end neural diarization,” inProc. ICASSP, 2023

  16. [16]

    EEND-M2F: Masked-attention mask transformers for speaker diarization,

    M. H ¨ark¨onen, S. J. Broughton, and L. Samarakoon, “EEND-M2F: Masked-attention mask transformers for speaker diarization,” in Proc. Interspeech, 2024, pp. 37–41

  17. [17]

    Sequence-to-sequence neural di- arization with automatic speaker detection and representation,

    M. Cheng, Y . Lin, and M. Li, “Sequence-to-sequence neural di- arization with automatic speaker detection and representation,” IEEE TASLPRO, vol. 33, pp. 2719–2734, 2025

  18. [18]

    pyannote.audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,

    H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,” inProc. Interspeech, 2023, pp. 1983–1987

  19. [19]

    Powerset multi-class cross entropy loss for neural speaker diarization,

    A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” inProc. Interspeech, 2023, pp. 3222–3226

  20. [20]

    Leveraging self-supervised learning for speaker diarization,

    J. Han, F. Landini, J. Rohdin, A. Silnova, M. Diez, and L. Burget, “Leveraging self-supervised learning for speaker diarization,” in Proc. ICASSP, 2025

  21. [21]

    Fine-tune before structured pruning: Towards compact and accurate self-supervised models for speaker diariza- tion,

    J. Han, F. Landini, J. Rohdin, A. Silnova, M. Diez, J. Cernocky, and L. Burget, “Fine-tune before structured pruning: Towards compact and accurate self-supervised models for speaker diariza- tion,” inProc. Interspeech, 2025, pp. 1583–1587

  22. [22]

    Can we really repurpose multi-speaker ASR corpus for speaker diarization?

    S. Horiguchi, N. Tawara, T. Ashihara, A. Ando, and M. Delcroix, “Can we really repurpose multi-speaker ASR corpus for speaker diarization?” inProc. ASRU, 2025

  23. [23]

    AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,

    Y . Fu, L. Cheng, S. Lv, Y . Jv, Y . Kong, Z. Chen, Y . Hu, L. Xie, J. Wu, H. Bu, X. Xu, J. Du, and J. Chen, “AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,” inProc. Inter- speech, 2021, pp. 3665–3669

  24. [24]

    Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus,

    J. Carletta, “Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus,”Language Resources and Evaluation, vol. 41, no. 2, pp. 181–190, 2007

  25. [25]

    M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,

    F. Yu, S. Zhang, Y . Fu, L. Xie, S. Zheng, Z. Du, W. Huang, P. Guo, Z. Yan, B. Ma, X. Xu, and H. Bu, “M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” inProc. ICASSP, 2022, pp. 6167–6171

  26. [26]

    The fifth ‘CHiME’ Speech Separation and Recognition Challenge: dataset, task and baselines,

    J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ‘CHiME’ Speech Separation and Recognition Challenge: dataset, task and baselines,” inProc. Interspeech, 2018, pp. 1561–1565

  27. [27]

    DiPCo—dinner party corpus,

    M. Van Segbroeck, A. Zaid, K. Kutsenko, C. Huerta, T. Nguyen, X. Luo, B. Hoffmeister, J. Trmal, M. Omologo, and R. Maas, “DiPCo—dinner party corpus,” inProc. Interspeech, 2020, pp. 434–436

  28. [28]

    Front-end processing for the CHiME-5 dinner party scenario,

    C. Boeddeker, J. Heitkaemper, J. Schmalenstoeer, L. Drude, J. Heymann, and R. Haeb-Umbach, “Front-end processing for the CHiME-5 dinner party scenario,” inProc. CHiME-5, 2018, pp. 35–40

  29. [29]

    GPU-accelerated guided source separation for meeting transcription,

    D. Raj, D. Povey, and S. Khudanpur, “GPU-accelerated guided source separation for meeting transcription,” inProc. Interspeech, 2023, pp. 3507–3511

  30. [30]

    Moshi: a speech-text foundation model for real-time dialogue,

    A. D ´efossez, L. Mazar´e, M. Orsini, A. Royer, P. P´erez, H. J´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,” arXiv:2410.00037, 2024

  31. [31]

    Investigat- ing the effects of large-scale pseudo-stereo data and different speech foundation model on dialogue generative spoken language model,

    Y .-K. Fu, C.-K. Lee, H.-H. Wang, and H.-y. Lee, “Investigat- ing the effects of large-scale pseudo-stereo data and different speech foundation model on dialogue generative spoken language model,” arXiv:2407.01911, 2024

  32. [32]

    Towards a japanese full-duplex spoken dialogue system,

    A. Ohashi, S. Iizuka, J. Jiang, and R. Higashinaka, “Towards a japanese full-duplex spoken dialogue system,” inProc. Inter- speech, 2025, pp. 1783–1787

  33. [33]

    Be- yond turn-based interfaces: Synchronous LLMs as full-duplex di- alogue agents,

    B. Veluri, B. Peloquin, B. Yu, H. Gong, and S. Gollakota, “Be- yond turn-based interfaces: Synchronous LLMs as full-duplex di- alogue agents,” inProc. NAACL, 2024, pp. 21 390–21 402

  34. [34]

    TurnGuide: Enhancing mean- ingfull full duplex spoken interactions via dynamic turn-level text- speech interleaving,

    W. Cui, L. Zhu, X. Li, and Z. Gui, “TurnGuide: Enhancing mean- ingfull full duplex spoken interactions via dynamic turn-level text- speech interleaving,” arxiv:2508.07375, 2026

  35. [35]

    Robust speech recognition via large-scale weak su- pervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProc. ICML, 2023, pp. 28 492–28 518

  36. [36]

    Audio Flamingo 3: Advancing audio intelligence with fully open large audio language models,

    S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio Flamingo 3: Advancing audio intelligence with fully open large audio language models,” inProc. NeurIPS, 2025

  37. [37]

    First DIHARD challenge evaluation plan,

    N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, “First DIHARD challenge evaluation plan,” https://zenodo.org/record/1199638, 2018

  38. [38]

    The Second DIHARD Diarization Challenge: Dataset, task, and baselines,

    ——, “The Second DIHARD Diarization Challenge: Dataset, task, and baselines,” inProc. Interspeech, 2019, pp. 978–982

  39. [39]

    The third DI- HARD diarization challenge,

    N. Ryant, P. Singh, V . Krishnamohan, R. Varma, K. Church, C. Cieri, J. Du, S. Ganapathy, and M. Liberman, “The third DI- HARD diarization challenge,” inProc. Interspeech, 2021, pp. 3570–3574

  40. [40]

    Third DIHARD challenge evaluation plan,

    N. Ryant, K. Church, C. Cieri, J. Du, S. Ganapathy, and M. Liberman, “Third DIHARD challenge evaluation plan,” arXiv:2006.05815, 2020

  41. [41]

    Spot the conversation: Speaker diarisation in the wild,

    J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman, “Spot the conversation: Speaker diarisation in the wild,” inProc. Interspeech, 2020, pp. 299–303

  42. [42]

    Pretrain- ing multi-speaker identification for neural speaker diarization,

    S. Horiguchi, A. Ando, M. Delcroix, and N. Tawara, “Pretrain- ing multi-speaker identification for neural speaker diarization,” in Proc. Interspeech, 2025, pp. 1608–1612

  43. [43]

    Efficient and generalizable speaker diarization via structured pruning of self-supervised models,

    J. Han, P. P ´alka, M. Delcroix, F. Landini, J. Rohdin, J. Cernock`y, and L. Burget, “Efficient and generalizable speaker diarization via structured pruning of self-supervised models,” arXiv:2506.18623, 2025

  44. [44]

    Learning with noisy labels,

    N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,” inProc. NeurIPS, vol. 26, 2013, pp. 1196–1204

  45. [45]

    Making deep neural networks robust to label noise: A loss cor- rection approach,

    G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making deep neural networks robust to label noise: A loss cor- rection approach,” inProc. CVPR, 2017, pp. 1944–1952

  46. [46]

    Does label smoothing mitigate label noise?

    M. Lukasik, S. Bhojanapalli, A. Menon, and S. Kumar, “Does label smoothing mitigate label noise?” inProc. ICML. PMLR, 2020, pp. 6448–6458

  47. [47]

    Learning to reweight examples for robust deep learning,

    M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight examples for robust deep learning,” inProc. ICML, 2018, pp. 4334–4343

  48. [48]

    MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels,

    L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels,” inProc. ICML, 2018, pp. 2304–2313

  49. [49]

    CurriculumNet: Weakly supervised learning from large-scale web images,

    S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, and D. Huang, “CurriculumNet: Weakly supervised learning from large-scale web images,” inProc. ECCV, 2018, pp. 135–150

  50. [50]

    Co-teaching: Robust training of deep neural net- works with extremely noisy labels,

    B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural net- works with extremely noisy labels,” inProc. NeurIPS, vol. 31, 2018, pp. 8527–8537

  51. [51]

    How does disagreement help generalization against label corruption?

    X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama, “How does disagreement help generalization against label corruption?” inProc. ICML, 2019, pp. 7164–7173

  52. [52]

    DIVE: End-to-end speech diarization via iterative speaker embeddings,

    N. Zeghidour, O. Teboul, and D. Grangier, “DIVE: End-to-end speech diarization via iterative speaker embeddings,” inProc. ASRU, 2021, pp. 702–709

  53. [53]

    Collar-aware training for streaming speaker change detection in broadcast speech,

    J. Kalda and T. Alum ¨ae, “Collar-aware training for streaming speaker change detection in broadcast speech,” inProc. Odyssey 2022, 2022, pp. 141–147

  54. [54]

    Forward-backward convolu- tional recurrent neural networks and tag-conditioned convolu- tional neural networks for weakly labeled semi-supervised sound event detection,

    J. Ebbers and R. Haeb-Umbach, “Forward-backward convolu- tional recurrent neural networks and tag-conditioned convolu- tional neural networks for weakly labeled semi-supervised sound event detection,” inProc. DCASE, 2020, pp. 41–45

  55. [55]

    V oxCeleb: A large- scale speaker identification dataset,

    A. Nagrani, J. S. Chung, and A. Zisserman, “V oxCeleb: A large- scale speaker identification dataset,” inProc. Interspeech, 2017, pp. 2616–2620

  56. [56]

    Mamba-based segmentation model for speaker diariza- tion,

    A. Plaquet, N. Tawara, M. Delcroix, S. Horiguchi, A. Ando, and S. Araki, “Mamba-based segmentation model for speaker diariza- tion,” inProc. ICASSP, 2025

  57. [57]

    WavLM: Large-scale self- supervised pre-training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  58. [58]

    ECAPA- TDNN: Emphasized channel attention, propagation and aggrega- tion in TDNN based speaker verification,

    B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA- TDNN: Emphasized channel attention, propagation and aggrega- tion in TDNN based speaker verification,” inProc. Interspeech, 2020, pp. 3830–3834

  59. [59]

    Reshape dimensions network for speaker recognition,

    I. Yakovlev, R. Makarov, A. Balykin, P. Malov, A. Okhotnikov, and N. Torgashov, “Reshape dimensions network for speaker recognition,” inProc. Interspeech, 2024, pp. 3235–3239

  60. [60]

    V oxCeleb: Large-scale speaker verification in the wild,

    A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “V oxCeleb: Large-scale speaker verification in the wild,”Computer Speech & Language, vol. 60, p. 101027, 2020

  61. [61]

    Squeeze-and-excitation networks,

    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” inProc. CVPR, 2018, pp. 7132–7141

  62. [62]

    Dual-mode ASR: Unify and improve streaming ASR with full-context modeling,

    J. Yu, W. Han, A. Gulati, C.-C. Chiu, B. Li, T. N. Sainath, Y . Wu, and R. Pang, “Dual-mode ASR: Unify and improve streaming ASR with full-context modeling,” inProc. ICLR, 2021

  63. [63]

    MSDWild: Multi- modal speaker diarization dataset in the wild,

    T. Liu, S. Fan, X. Xiang, H. Song, S. Lin, J. Sun, T. Han, S. Chen, B. Yao, S. Liu, Y . Wu, Y . Qian, and K. Yu, “MSDWild: Multi- modal speaker diarization dataset in the wild,” inProc. Inter- speech, 2022, pp. 1476–1480

  64. [64]

    Adam: A method for stochastic opti- mization,

    D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” inProc. ICLR, 2015

  65. [65]

    SE-DiCoW: Self-enrolled diarization-conditioned Whisper,

    A. Polok, D. Klement, S. Cornell, M. Wiesner, J. ˇCernock`y, S. Khudanpur, and L. Burget, “SE-DiCoW: Self-enrolled diarization-conditioned Whisper,” inProc. ICASSP, 2026

  66. [66]

    MeetEval: A toolkit for computation of word error rates for meeting transcription systems,

    T. von Neumann, C. Boeddeker, M. Delcroix, and R. Haeb- Umbach, “MeetEval: A toolkit for computation of word error rates for meeting transcription systems,” inProc. CHiME, 2023, pp. 27–32

  67. [67]

    Speech enhancement us- ing self-supervised pre-trained model and vector quantization,

    X.-Y . Zhao, Q.-S. Zhu, and J. Zhang, “Speech enhancement us- ing self-supervised pre-trained model and vector quantization,” in Proc. APSIPA ASC, 2022, pp. 330–334