Cross-Talk Speech Reduction, by Separation, for Separation
Pith reviewed 2026-05-20 01:50 UTC · model grok-4.3
The pith
Cross-talk reduction on real close-talk recordings produces pseudo-labels that train far-field separation models to new state-of-the-art ASR levels on CHiME-6.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a network called CTRnet, trained end-to-end on real-recorded close-talk and far-field mixture pairs, isolates each speaker's voice from cross-talk interference; the resulting estimates serve as effective pseudo-labels for a second stage, pseudo-label based far-field speech separation, that achieves state-of-the-art ASR word error rates on the CHiME-6 dataset under both oracle and estimated diarization while surpassing all prior CHiME-7 and CHiME-8 submissions.
What carries the argument
CTRnet, a neural separation model trained on real close-talk/far-field pairs to isolate the wearer's speech from cross-talk, whose outputs supply the pseudo-labels for the PuLSS far-field training stage.
If this is right
- Both CTRnet and the downstream far-field model can be trained entirely on real target-domain recordings without simulation.
- The framework delivers state-of-the-art ASR under oracle and estimated speaker diarization on CHiME-6.
- It is the first neural separation approach shown to substantially outperform guided source separation on real conversational data.
- Close-talk mixtures, previously too noisy for direct use, become usable weak supervision after cross-talk reduction.
Where Pith is reading between the lines
- The same real-pair training pattern could be applied to other microphone arrays where partial close-talk signals exist.
- Integrating CTRnet-style reduction inside an end-to-end diarization-plus-separation pipeline might further reduce error propagation.
- The pseudo-label strategy suggests a general route for adapting separation models to new acoustic environments using only the deployment hardware.
Load-bearing premise
The speech estimates produced by CTRnet on real close-talk mixtures must be clean enough to improve far-field model training rather than inject harmful label noise.
What would settle it
If ASR word error rate on the CHiME-6 evaluation set rises or stays flat when far-field models are retrained with CTRnet pseudo-labels instead of guided source separation labels, the central claim is false.
Figures
read the original abstract
In conversational speech separation and recognition tasks, close-talk microphones are typically attached to each speaker during training data collection to capture near-field, close-talk mixture signals, in addition to using far-field microphones to record far-field mixture signals. Each such close-talk mixture exhibits a reasonably high energy level for the wearer and could intuitively serve as weak supervision for training far-field speech separation models directly on real-recorded far-field signals. However, they are not sufficiently clean for this purpose, as they often contain strong cross-talk speech from other speakers in addition to background noise. To address this, we propose cross-talk reduction (CTR), a task aiming to isolate the wearer's speech from each close-talk mixture, and a novel method called CTRnet, which can be trained directly on real-recorded pairs of close-talk and far-field mixtures to accomplish CTR. Building on CTRnet, we further propose pseudo-label based far-field speech separation (PuLSS), which uses CTRnet's estimated clean speech as pseudo-labels to train models for separating far-field mixtures. A key advantage of the proposed framework is that both CTRnet and PuLSS can be trained on real-recorded data from the target domain, addressing the generalization gap commonly observed when models are trained exclusively on simulated data. On the CHiME-6 dataset, our framework achieves state-of-the-art ASR performance under both oracle and estimated speaker diarization, surpassing all CHiME-{7,8} challenge submissions. To our knowledge, it is the first neural speech separation method that substantially outperforms guided source separation on real conversational "speech-in-the-wild" data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CTRnet, a model trained directly on real-recorded close-talk/far-field mixture pairs to perform cross-talk reduction (CTR) on close-talk signals, followed by PuLSS which uses the resulting estimates as pseudo-labels to train far-field speech separation. On CHiME-6, the framework reports state-of-the-art ASR word error rates under both oracle and estimated diarization, surpassing prior CHiME-7/8 submissions and guided source separation on real conversational data.
Significance. If the central claims hold after verification of pseudo-label quality, the work would be significant for demonstrating that real-domain training via auxiliary close-talk microphones can close the simulation-to-real gap in speech separation and recognition. The two-stage design and explicit use of real pairs address a persistent practical limitation; credit is due for focusing on held-out real evaluation rather than simulated data alone.
major comments (3)
- [§3.2] §3.2 (PuLSS description): The claim that CTRnet outputs from real close-talk mixtures provide effective pseudo-labels for far-field training is load-bearing for the reported ASR gains, yet no quantitative verification (SI-SDR, PESQ, or oracle ASR delta on held-out real segments) is supplied to show the estimates are sufficiently clean rather than noisy; residual cross-talk or artifacts could explain or undermine the improvements over guided source separation.
- [§4] §4 (Experimental results): The SOTA ASR numbers on CHiME-6 under estimated diarization lack ablations isolating the contribution of CTRnet pseudo-labels versus raw close-talk signals or simulated-data baselines, and no statistical significance tests or error bars are reported to confirm the gains are robust rather than dataset-specific.
- [§2.2] §2.2 (CTRnet training): The supervision mechanism for training CTRnet on real pairs without clean targets is not fully specified; if the loss relies on far-field mixtures in a way that introduces circular dependence, the pseudo-label step risks reducing to a fitted quantity rather than providing independent supervision.
minor comments (2)
- [§3] Notation for the two-stage pipeline (CTRnet then PuLSS) should be introduced with a single diagram or equation block to avoid repeated re-definition across sections.
- [Table 1] Table 1 (baseline comparisons) would benefit from explicit column for training data type (real vs. simulated) to highlight the domain-gap advantage claimed in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below, indicating where we will revise the manuscript to strengthen the presentation and where we provide additional clarification.
read point-by-point responses
-
Referee: [§3.2] §3.2 (PuLSS description): The claim that CTRnet outputs from real close-talk mixtures provide effective pseudo-labels for far-field training is load-bearing for the reported ASR gains, yet no quantitative verification (SI-SDR, PESQ, or oracle ASR delta on held-out real segments) is supplied to show the estimates are sufficiently clean rather than noisy; residual cross-talk or artifacts could explain or undermine the improvements over guided source separation.
Authors: We agree that explicit quantitative verification of pseudo-label quality would strengthen the manuscript. In the revision we will add SI-SDR and PESQ results on held-out real close-talk segments (where reference signals permit) together with an oracle-ASR delta obtained by feeding CTRnet outputs directly into the recognizer. These metrics will be reported alongside the existing end-to-end ASR results on CHiME-6 to demonstrate that the pseudo-labels are sufficiently clean to drive the observed gains over guided source separation. revision: yes
-
Referee: [§4] §4 (Experimental results): The SOTA ASR numbers on CHiME-6 under estimated diarization lack ablations isolating the contribution of CTRnet pseudo-labels versus raw close-talk signals or simulated-data baselines, and no statistical significance tests or error bars are reported to confirm the gains are robust rather than dataset-specific.
Authors: We will incorporate the requested ablations in the revised manuscript: (i) PuLSS trained on raw close-talk signals without CTRnet, (ii) PuLSS trained exclusively on simulated data, and (iii) the full CTRnet + PuLSS pipeline. We will also report error bars obtained from multiple independent training runs and include paired statistical significance tests on the WER differences to establish that the improvements are robust. revision: yes
-
Referee: [§2.2] §2.2 (CTRnet training): The supervision mechanism for training CTRnet on real pairs without clean targets is not fully specified; if the loss relies on far-field mixtures in a way that introduces circular dependence, the pseudo-label step risks reducing to a fitted quantity rather than providing independent supervision.
Authors: Section 2.2 specifies that CTRnet is trained by minimizing a composite loss comprising a reconstruction term on the close-talk output and a cross-domain consistency term that aligns the estimated wearer speech with the corresponding far-field mixture after accounting for acoustic differences. The far-field signal is used only as an auxiliary reference for the shared speech content and is never employed as a direct target for the close-talk output; therefore the supervision remains independent. We will expand the loss formulation with explicit equations in the revision to remove any remaining ambiguity. revision: yes
Circularity Check
No significant circularity; training and pseudo-label steps remain independent of final metrics
full rationale
The derivation chain begins with CTRnet trained directly on real-recorded close-talk/far-field pairs to produce cross-talk-reduced estimates, which are then used as pseudo-labels to train PuLSS for far-field separation. No quoted equations, self-citations, or fitted parameters in the abstract or described framework reduce the reported CHiME-6 ASR gains by construction to the input pairs or to a renamed version of the same supervision signal. The pseudo-label quality assumption is an empirical claim subject to verification on held-out data rather than a definitional loop, and the SOTA result is presented as an observed outcome rather than a statistical necessity from the training setup itself.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Close-talk mixtures contain recoverable information about the wearer's speech that a neural network can isolate from cross-talk.
- domain assumption Pseudo-labels generated by CTRnet are sufficiently accurate to supervise far-field separation training.
invented entities (2)
-
CTRnet
no independent evidence
-
PuLSS
no independent evidence
Reference graph
Works this paper leans on
-
[1]
P. Comon and C. Jutten,Handbook of Blind Source Separation: Inde- pendent component analysis and applications. Academic press, 2010
work page 2010
-
[2]
J. H. McDermott, “The Cocktail Party Problem,”Current Biology, vol. 19, no. 22, pp. 1024–1027, 2009
work page 2009
-
[3]
Supervised Speech Separation Based on Deep Learning: An Overview,
D. Wang and J. Chen, “Supervised Speech Separation Based on Deep Learning: An Overview,”IEEE/ACM Trans. Audio, Speech, Lang. Pro- cess., vol. 26, no. 10, pp. 1702–1726, 2018
work page 2018
-
[4]
30+ Years of Source Separation Research: Achievements and Future Challenges,
S. Araki, N. Ito, R. Haeb-Umbach, G. Wichern, Z.-Q. Wang, and Y . Mitsufuji, “30+ Years of Source Separation Research: Achievements and Future Challenges,” inProc. ICASSP, 2025
work page 2025
-
[5]
Far-Field Automatic Speech Recognition,
R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix, and T. Nakatani, “Far-Field Automatic Speech Recognition,”Proc. IEEE, vol. 109, no. 2, pp. 124–148, 2021
work page 2021
-
[6]
R. Haeb-Umbach, T. Nakatani, M. Delcroix, C. Boeddeker, and T. Ochiai, “Microphone Array Signal Processing and Deep Learning for Speech Enhancement: Combining Model-Based and Data-Driven Ap- proaches to Parameter Estimation and Filtering,”IEEE Signal Process. Mag., vol. 41, no. 6, pp. 12–23, 2025
work page 2025
-
[7]
TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation,
Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watanabe, “TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 3221–3236, 2023
work page 2023
-
[8]
W. Zhang, J. Shi, C. Li, S. Watanabe, and Y . Qian, “Closing The Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions,” inProc. WASPAA, 2021, pp. 146–150
work page 2021
-
[9]
Real-M: Towards Speech Separation on Real Mixtures,
C. Subakan, M. Ravanelli, S. Cornell, and F. Grondin, “Real-M: Towards Speech Separation on Real Mixtures,” inProc. ICASSP, 2022, pp. 6862– 6866
work page 2022
-
[10]
Summary of the NOTSOFAR-1 challenge: Highlights and learnings,
I. Abramovski, A. Vinnikov, S. Shaeret al., “Summary of the NOTSOFAR-1 challenge: Highlights and learnings,”Comput. Speech Lang., vol. 93, p. 101796, 2025
work page 2025
-
[11]
S. Cornell, C. Boeddeker, T. Park, H. Huang, D. Raj, M. Wiesner, Y . Masuyama, X. Chang, Z.-Q. Wang, S. Squartini, P. Garcia, and S. Watanabe, “Recent Trends in Distant Conversational Speech Recogni- tion: A Review of CHiME-7 and 8 DASR Challenges,”Comput. Speech Lang., vol. 97, 2026
work page 2026
-
[12]
Y . Masuyama, X. Chang, W. Zhang, S. Cornell, Z.-Q. Wang, N. Ono, Y . Qian, and S. Watanabe, “An End-to-End Integration of Speech Sepa- ration and Recognition with Self-Supervised Learning Representation,” Comput. Speech Lang., vol. 95, p. 101813, 2026
work page 2026
-
[13]
The AMI Meeting Corpus: A Pre-Announcement,
J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemotet al., “The AMI Meeting Corpus: A Pre-Announcement,” inMachine Learning for Multimodal Interaction, 2006, pp. 28–39. 16
work page 2006
-
[14]
Summary on The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge,
F. Yu, S. Zhang, P. Guo, Y . Fu, Z. Duet al., “Summary on The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge,” inProc. ICASSP, 2022, pp. 9156–9160
work page 2022
-
[15]
The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,
J. Barker, S. Watanabe, E. Vincentet al., “The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,” in Proc. Interspeech, 2018, pp. 1561–1565
work page 2018
-
[16]
Z. Wang, S. Wu, H. Chenet al., “The Multimodal Information Based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition,” inProc. ICASSP, 2023, pp. 1–5
work page 2022
-
[17]
Z.-Q. Wang, A. Kumar, and S. Watanabe, “Cross-Talk Reduction,” in Proc. IJCAI, 2024, pp. 5171–5180
work page 2024
-
[18]
BLSTM Supported GEV Beamformer Front-End for The 3rd CHiME Challenge,
J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM Supported GEV Beamformer Front-End for The 3rd CHiME Challenge,” inProc. ASRU, 2015, pp. 444–451
work page 2015
-
[19]
Improved MVDR Beamforming using Single-Channel Mask Prediction Networks,
H. Erdogan, J. R. Hershey, S. Watanabe, I. Mandel, and J. Le Roux, “Improved MVDR Beamforming using Single-Channel Mask Prediction Networks,” inProc. Interspeech, 2016, pp. 1981–1985
work page 2016
-
[20]
How Bad Are Artifacts?: Analyzing The Impact of Speech Enhancement Errors on ASR,
K. Iwamoto, T. Ochiai, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, “How Bad Are Artifacts?: Analyzing The Impact of Speech Enhancement Errors on ASR,”Proc. Interspeech, pp. 5418–5422, 2022
work page 2022
-
[21]
N. Kanda, J. Wu, X. Wang, Z. Chen, J. Li, and T. Yoshioka, “VarArray Meets t-SOT: Advancing The State of The Art of Streaming Distant Conversational Speech Recognition,” inProc. ICASSP, 2023, pp. 1–5
work page 2023
-
[22]
Unsupervised Sound Separation using Mixture Invariant Training,
S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, and J. R. Hershey, “Unsupervised Sound Separation using Mixture Invariant Training,”Proc. NeurIPS, vol. 33, pp. 3846–3857, 2020
work page 2020
-
[23]
Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation,
J. Zhang, C. Zorila, R. Doddipatla, and J. Barker, “Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation,” in Proc. Interspeech, 2021
work page 2021
-
[24]
Self-Remixing: Unsupervised Speech Separa- tion via Separation and Remixing,
K. Saijo and T. Ogawa, “Self-Remixing: Unsupervised Speech Separa- tion via Separation and Remixing,” inProc. ICASSP, 2023, pp. 1–5
work page 2023
-
[25]
Unsupervised Multi- Channel Separation and Adaptation,
C. Han, K. Wilson, S. Wisdom, and J. R. Hershey, “Unsupervised Multi- Channel Separation and Adaptation,” inProc. ICASSP, 2024, pp. 721– 725
work page 2024
-
[26]
Adapting Speech Separation to Real-World Meetings using Mixture Invariant Training,
A. Sivaraman, S. Wisdom, H. Erdogan, and J. R. Hershey, “Adapting Speech Separation to Real-World Meetings using Mixture Invariant Training,” inProc. ICASSP, 2022, pp. 686–690
work page 2022
-
[27]
S. Wisdom, A. Jansen, R. J. Weiss, H. Erdogan, and J. R. Hershey, “Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In- the-wild Unsupervised Sound Separation,” inProc. WASPAA, 2021, pp. 51–55
work page 2021
-
[28]
UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-Determined Training Mixtures,
Z.-Q. Wang and S. Watanabe, “UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-Determined Training Mixtures,” inProc. NeurIPS, vol. 36, 2023, pp. 34 021–34 042
work page 2023
-
[29]
Enhanced Reverberation as Supervision for Unsupervised Speech Separation,
K. Saijo, G. Wichern, F. G. Germain, Z. Pan, and J. Le Roux, “Enhanced Reverberation as Supervision for Unsupervised Speech Separation,” in Proc. Interspeech, 2024, pp. 607–611
work page 2024
-
[30]
Spatial Loss for Unsupervised Multi-channel Source Separation,
K. Saijo and R. Scheibler, “Spatial Loss for Unsupervised Multi-channel Source Separation,” inProc. Interspeech, 2022, pp. 241–245
work page 2022
-
[31]
Front-End Processing for The CHiME-5 Dinner Party Scenario,
C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Hey- mann, and R. Haeb-Umbach, “Front-End Processing for The CHiME-5 Dinner Party Scenario,” inProc. CHiME, vol. 1, 2018
work page 2018
-
[32]
VarArray: Array-Geometry-Agnostic Continuous Speech Separation,
T. Yoshioka, X. Wang, D. Wang, M. Tang, Z. Zhu, Z. Chen, and N. Kanda, “VarArray: Array-Geometry-Agnostic Continuous Speech Separation,” inProc. ICASSP, 2022, pp. 6027–6031
work page 2022
-
[33]
Continuous speech separation: Dataset and analysis,
Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y . Luo, J. Wu, X. Xiao, and J. Li, “Continuous speech separation: Dataset and analysis,” inProc. ICASSP, 2020, pp. 7284–7288
work page 2020
-
[34]
NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription,
A. Vinnikov, A. Ivry, A. Hurvitzet al., “NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription,” in Proc. Interspeech, 2024
work page 2024
-
[35]
Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization,
T. V on Neumann, C. Boeddeker, T. Cord-Landwehr, M. Delcroix, and R. Haeb-Umbach, “Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization,” inProc. HSCMA, 2024, pp. 775–779
work page 2024
-
[36]
Neural fast full-rank spatial covariance analysis for blind source separation,
Y . Bando, Y . Masuyama, A. A. Nugraha, and K. Yoshii, “Neural fast full-rank spatial covariance analysis for blind source separation,” inProc. EUSIPCO, 2023, pp. 51–55
work page 2023
-
[37]
Neural Blind Source Separation and Diarization for Distant Speech Recognition,
Y . Bando, T. Nakamura, and S. Watanabe, “Neural Blind Source Separation and Diarization for Distant Speech Recognition,” inProc. Interspeech, 2024, pp. 722–726
work page 2024
-
[38]
Y . Bando, S. Cornell, S. Fukayama, and S. Watanabe, “Investigation of Spatial Self-Supervised Learning and Its Application to Target Speaker Speech Recognition,” inProc. ICASSP, 2025, pp. 1–5
work page 2025
-
[39]
ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement,
Z.-Q. Wang, “ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement,”J. Acoust. Soc. Am., vol. 158, no. 4, pp. 2849– 2862, 2025
work page 2025
-
[40]
SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR,
——, “SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR,”Neural Networks, vol. 188, no. 107408, pp. 1–16, 2025
work page 2025
-
[41]
Mixture to Mixture: Leveraging Close-Talk Mixtures as Weak- Supervision for Speech Separation,
——, “Mixture to Mixture: Leveraging Close-Talk Mixtures as Weak- Supervision for Speech Separation,”IEEE Signal Process. Lett., vol. 31, pp. 1715–1719, 2024
work page 2024
-
[42]
L. Luo, L. Li, and Q. Hong, “SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recogni- tion,” inProc. Interspeech, 2025, pp. 3404–3408
work page 2025
-
[43]
Pseudo Labels-Based Neural Speech Enhancement for the A VSR Task in The MISP-Meeting Challenge,
L. Luo, S. Lu, L. Li, and Q. Hong, “Pseudo Labels-Based Neural Speech Enhancement for the A VSR Task in The MISP-Meeting Challenge,” in Proc. Interspeech, 2025, pp. 1883–1887
work page 2025
-
[44]
Relative Transfer Function Iden- tification using Convolutive Transfer Function Approximation,
R. Talmon, I. Cohen, and S. Gannot, “Relative Transfer Function Iden- tification using Convolutive Transfer Function Approximation,”IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 4, pp. 546–555, 2009
work page 2009
-
[45]
A Consolidated Perspective on Multi- Microphone Speech Enhancement and Source Separation,
S. Gannot, E. Vincentet al., “A Consolidated Perspective on Multi- Microphone Speech Enhancement and Source Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, pp. 692–730, 2017
work page 2017
-
[46]
Understanding Blind Deconvolution Algorithms,
A. Levin, Y . Weiss, F. Durand, and W. T. Freeman, “Understanding Blind Deconvolution Algorithms,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, pp. 2354–2367, 2011
work page 2011
-
[47]
Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation,
Z.-Q. Wang, G. Wichern, and J. Le Roux, “Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3476–3490, 2021
work page 2021
-
[48]
Differentiable Consistency Constraints for Improved Deep Speech Enhancement,
S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable Consistency Constraints for Improved Deep Speech Enhancement,” inProc. ICASSP, 2019, pp. 900–904
work page 2019
-
[49]
M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 10, pp. 1901–1913, 2017
work page 1901
-
[50]
D. Raj, P. Denisov, Z. Chen, H. Erdogan, Z. Huang, M. He, S. Watanabe, J. Du, T. Yoshioka, Y . Luoet al., “Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System De- scription, Comparison, and Analysis,” inProc. SLT, 2021, pp. 897–904
work page 2021
-
[51]
CHiME-6 Challenge: Tack- ling Multispeaker Speech Recognition for Unsegmented Recordings,
S. Watanabe, M. Mandel, J. Barkeret al., “CHiME-6 Challenge: Tack- ling Multispeaker Speech Recognition for Unsegmented Recordings,” in Proc. CHiME, 2020, pp. 1–7
work page 2020
-
[52]
S. Cornell, M. S. Wiesner, S. Watanabe, D. Raj, X. Chang, P. Garcia, Y . Masuyam, Z.-Q. Wang, S. Squartini, and S. Khudanpur, “The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multi- ple Devices in Diverse Scenarios,” inProc. CHiME, 2023, pp. 1–6
work page 2023
-
[53]
S. Cornell, T. J. Park, H. Huang, C. Boeddeker, X. Chang, M. Maciejew- ski, M. S. Wiesner, P. Garcia, and S. Watanabe, “The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization,” inProc. CHiME, 2024, pp. 1– 6
work page 2024
-
[54]
The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge,
R. Wang, M. He, J. Duet al., “The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge,” inProc. CHiME, 2023, pp. 13–18
work page 2023
-
[55]
The IACAS- Thinkit System for CHiME-7 Challenge,
L. Ye, H. Lu, G. Cheng, Y . Chen, Z. Shang, and X. Li, “The IACAS- Thinkit System for CHiME-7 Challenge,” inProc. CHiME, 2023, pp. 23–26
work page 2023
-
[56]
Multi-Microphone Complex Spec- tral Mapping for Utterance-Wise and Continuous Speech Separation,
Z.-Q. Wang, P. Wang, and D. Wang, “Multi-Microphone Complex Spec- tral Mapping for Utterance-Wise and Continuous Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2001– 2014, 2021
work page 2001
-
[57]
Librispeech: An ASR Corpus Based on Public Domain Audio Books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR Corpus Based on Public Domain Audio Books,”Proc. ICASSP, pp. 5206–5210, 2015
work page 2015
-
[58]
EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,
J. Richter, Y . C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation,” inProc. Interspeech, 2024, pp. 4873–4877
work page 2024
-
[59]
FSD50K: An Open Dataset of Human-Labeled Sound Events,
E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An Open Dataset of Human-Labeled Sound Events,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 829–852, 2022
work page 2022
-
[60]
K. Kinoshita, M. Delcroix, S. Gannotet al., “A Summary of The REVERB Challenge: State-of-The-Art and Remaining Challenges in Re- verberant Speech Processing Research,”Eurasip J. Adv. Signal Process., vol. 2016, no. 1, pp. 1–19, 2016
work page 2016
-
[61]
Word Error Rate Definitions and Algorithms for Long-Form Multi- Talker Speech Recognition,
T. von Neumann, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach, “Word Error Rate Definitions and Algorithms for Long-Form Multi- Talker Speech Recognition,”IEEE Trans. Audio, Speech, Lang. Process., 2025. 17
work page 2025
-
[62]
arXiv preprint arXiv:2509.14128 , year =
M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bartley, N. Karpov, J. Balam, and B. Ginsburg, “Canary-1B-v2 & Parakeet-TDT- 0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST,”arXiv preprint arXiv:2509.14128, 2025
-
[63]
arXiv preprint arXiv:1909.09577 , year=
O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V . Lavrukhin, J. Cooket al., “NeMo: A Toolkit for Building AI Applications using Neural Modules,”arXiv preprint arXiv:1909.09577, 2019
-
[64]
BUT CHiME-7 System Description,
M. Karafiat, K. Vesel ´y, I. Szoke, L. Mosner, K. Benes, M. Witkowski, R. G. Barchi, and L. D. Pepino, “BUT CHiME-7 System Description,” inProc. CHiME, 2023, pp. 67–72
work page 2023
-
[65]
The NPU System for DASR Task of CHiME-7 Challenge,
B. Mu, P. Guo, H. Wang, Y . Li, Y . Li, P. Zhou, W. Chen, and L. Xie, “The NPU System for DASR Task of CHiME-7 Challenge,” inProc. CHiME, 2023, pp. 63–66
work page 2023
-
[66]
The University of Cambridge System for the CHiME-7 DASR Task,
K. Deng, X. Zheng, and P. Woodland, “The University of Cambridge System for the CHiME-7 DASR Task,” inProc. CHiME, 2023, pp. 73– 76
work page 2023
-
[67]
NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge,
N. Kamo, N. Tawara, A. Andoet al., “NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge,” inProc. CHiME, 2024, pp. 69–74
work page 2024
-
[68]
STCON System for the CHiME-8 Challenge,
A. Mitrofanov, T. Prisyach, T. Timofeevaet al., “STCON System for the CHiME-8 Challenge,” inProc. CHiME, 2024, pp. 13–17
work page 2024
-
[69]
The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines,
N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, “The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines,” inProc. Interspeech, 2019, pp. 978–982
work page 2019
-
[70]
pyannote.audio: Neural Building Blocks for Speaker Diarization,
H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeuxet al., “pyannote.audio: Neural Building Blocks for Speaker Diarization,” inProc. ICASSP, 2020, pp. 7124–7128
work page 2020
-
[71]
Integrating End-to-End Neural and Clustering-Based Diarization: Getting The Best of Both Worlds,
K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating End-to-End Neural and Clustering-Based Diarization: Getting The Best of Both Worlds,” inProc. ICASSP, 2021, pp. 7198–7202
work page 2021
-
[72]
Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor,
Y . Lee, S. Choi, B. Y . Kim, Z. Q. Wang, and S. Watanabe, “Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor,” inProc. ICASSP, 2024, pp. 446–450. Zhong-Qiu Wangreceived the B.E. degree in com- puter science and technology from Harbin Institute of Technology, Harbin, China, in2013, and the Ph.D. degree in computer science...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.