Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming

Masahito Togami; Tatsuya Komatsu; Yoshiki Masuyama

arxiv: 1907.04984 · v1 · pith:ZV7SL6OQnew · submitted 2019-07-11 · 💻 cs.SD · eess.AS

Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming

Yoshiki Masuyama , Masahito Togami , Tatsuya Komatsu This is my paper

Pith reviewed 2026-05-24 23:09 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords mask-based beamformingmultichannel lossItakura-Saito divergencespatial covariance matrixtime-frequency maskDNN speech separationmicrophone array

0 comments

The pith

DNNs trained with multichannel Itakura-Saito divergence on spatial covariance matrices yield better mask-based beamformers than those trained with monaural losses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces multichannel loss functions to train DNNs that estimate time-frequency masks for beamforming in speech source separation. Standard monaural losses do not align with the final beamforming step that relies on spatial covariance matrices derived from those masks. The new losses apply the multichannel Itakura-Saito divergence directly to the estimated covariance matrices, closing the training-application gap. The resulting DNNs support multiple beamformer constructions and remain effective across different microphone arrays.

Core claim

DNNs trained by multichannel loss functions that evaluate estimated spatial covariance matrices via the multichannel Itakura-Saito divergence produce time-frequency masks that enable effective mask-based beamforming for supervised speech source separation, with demonstrated robustness to microphone configuration changes.

What carries the argument

Multichannel Itakura-Saito divergence applied to spatial covariance matrices estimated from DNN-generated time-frequency masks.

If this is right

The same trained DNN can be reused to build several different beamformers without retraining.
Training directly optimizes quantities used in the beamformer, reducing mismatch between criterion and application.
Performance holds across varying numbers and placements of microphones.
Mask estimation quality improves when the loss respects multichannel statistics rather than single-channel spectra.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same divergence-based training could be tested on other array-processing tasks that depend on covariance estimates, such as source localization.
If the covariance matrices are the dominant factor, simpler mask estimators might suffice once the loss is multichannel.
The approach suggests a template for designing losses that operate on intermediate statistics rather than final waveforms in other audio ML pipelines.

Load-bearing premise

Evaluating spatial covariance matrices with the multichannel Itakura-Saito divergence during training produces masks whose beamforming performance matches the loss value.

What would settle it

A controlled test in which DNNs achieve lower multichannel Itakura-Saito divergence yet yield no improvement (or a drop) in beamformed separation metrics such as SI-SDR compared with monaural-loss baselines.

Figures

Figures reproduced from arXiv: 1907.04984 by Masahito Togami, Tatsuya Komatsu, Yoshiki Masuyama.

**Figure 1.** Figure 1: Block diagram of mask-based beamforming. The proposed methods use multichannel loss functions which evaluate spatial covariance matrices (red) while conventional methods use monaural loss functions (blue). does not depend on microphone configurations, and the effectiveness of the mask-based beamforming has been shown in noise-robust ASR [14, 15]. While mask-based beamforming for speech enhancement and no… view at source ↗

**Figure 2.** Figure 2: Network architecture used in experiment. Mask-based beamformers were calculated from TF-masks Mt,f,n. Timevarying activation vt,f,n was used in only Prop. 1 and [28] for calculating time-varying spatial covariance matrices. activation. In order to solve this problem, we can use PIT [23]. That is, the permutation problem is solved so that the loss function takes small value. 4. Experiment In order to conf… view at source ↗

read the original abstract

In this paper, we propose two mask-based beamforming methods using a deep neural network (DNN) trained by multichannel loss functions. Beamforming technique using time-frequency (TF)-masks estimated by a DNN have been applied to many applications where TF-masks are used for estimating spatial covariance matrices. To train a DNN for mask-based beamforming, loss functions designed for monaural speech enhancement/separation have been employed. Although such a training criterion is simple, it does not directly correspond to the performance of mask-based beamforming. To overcome this problem, we use multichannel loss functions which evaluate the estimated spatial covariance matrices based on the multichannel Itakura--Saito divergence. DNNs trained by the multichannel loss functions can be applied to construct several beamformers. Experimental results confirmed their effectiveness and robustness to microphone configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces multichannel Itakura-Saito losses on spatial covariance matrices to train mask estimators for beamforming, directly targeting the monaural loss mismatch.

read the letter

The core move is to replace monaural enhancement losses with two multichannel versions based on Itakura-Saito divergence between estimated and target spatial covariance matrices. This trains the DNN mask estimator so its output better supports the downstream beamformer instead of optimizing an unrelated single-channel metric. The same trained masks can then be plugged into several standard beamformers, which is a practical plus. They also report robustness across different microphone counts and geometries, which matters for real deployments. That addresses a genuine training-objective gap in mask-based beamforming pipelines. The idea is clean and the math follows from standard covariance estimation steps without obvious circularity or hidden fitting. The abstract states that experiments confirmed effectiveness, so the paper's value hinges on whether the full results show clear gains over monaural baselines with proper controls and error bars. If those tables hold up, the contribution is useful within its niche. If the improvements are small or sensitive to implementation details, the practical impact shrinks. This work is aimed at researchers and engineers building multi-microphone speech separation systems. Anyone already using DNN masks for covariance estimation will see the training change immediately. It is worth sending to peer review because the central fix is well-motivated and the claims are falsifiable with standard audio metrics.

Referee Report

0 major / 1 minor

Summary. The paper proposes two mask-based beamforming methods for supervised speech source separation in which a DNN is trained using multichannel loss functions based on the Itakura-Saito divergence applied to estimated spatial covariance matrices. This training criterion is intended to directly address the mismatch between conventional monaural losses and the final beamforming performance. The resulting DNN masks are used to construct several beamformers, and the abstract states that experiments confirm effectiveness together with robustness to microphone configurations.

Significance. If the experimental claims hold, the work supplies a training objective that is more closely aligned with the downstream beamforming task than monaural losses, which could improve separation quality in multichannel settings and reduce sensitivity to array geometry.

minor comments (1)

Abstract: the claim of experimental confirmation is stated without any quantitative results, baselines, error bars, or statistical analysis, which prevents verification of the central effectiveness and robustness assertions from the provided text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their time and for providing a concise summary of our work on multichannel Itakura-Saito divergence losses for mask-based beamforming. The report does not enumerate any specific major comments or criticisms. We therefore have no individual points to rebut or revise at this stage. If the editor or referee can supply the detailed major comments that were intended, we will respond to them promptly and in full.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines a multichannel Itakura-Saito loss on estimated spatial covariance matrices as the training objective for the DNN mask estimator, then applies the resulting masks to construct beamformers and evaluates separation quality on held-out data. This objective is not algebraically equivalent to the final beamforming metrics by the paper's own equations, nor is any prediction shown to be a direct renaming or refit of the training loss. No self-citation chain is load-bearing for the central claim, and the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The Itakura-Saito divergence is treated as a standard tool.

pith-pipeline@v0.9.0 · 5676 in / 951 out tokens · 14475 ms · 2026-05-24T23:09:12.546533+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we use multichannel loss functions which evaluate the estimated spatial covariance matrices based on the multichannel Itakura–Saito divergence
IndisputableMonolith/Cost.lean Jcost_pos_of_ne_one matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

L2 = ∑ tr(Xt,f X̂t,f⁻¹) + log det(X̂t,f)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

Introduction Speech source separation is a fundamental technique with many applications including automatic speech recognition (ASR) [1, 2] and hearing aid [3]. Although speech source separation with a single microphone is applicable [4], that with multiple micro- phones is more effective because it can take advantage of spatial information [5]. There exi...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

Preriminary 2.1. Mask-based beamforming Let N source signals be observed by M microphones, xt,f,m be the observed mixture, and ct,f,n,m be the nth source sig- nal observed at the mth microphone where t = 1 , . . . , T and f = 1 , . . . , F are time and frequency indices, respectively. A separated source ˆct,f,n obtained by beamforming is given as ˆct,f,n ...

work page
[3]

Proposed mask-based beamforming with multichannel loss function In this paper, we propose two mask-based beamforming meth- ods using DNNs trained by multichannel loss functions which evaluate the estimated spatial covariance matrices as illustrated in Fig. 1. After reviewing a multichannel loss function for time- varying MWF [28], the proposed time-invari...

work page
[4]

(3) [10] and by multi- channel loss functions were compared in speaker-independent multi-talker separation by the mask-based beamforming

Experiment In order to conﬁrm the effectiveness of the multichannel loss functions, DNNs trained by PSA in Eq. (3) [10] and by multi- channel loss functions were compared in speaker-independent multi-talker separation by the mask-based beamforming. Based on the spatial covariance matrices estimated by TF-masking, three beamformers (MVDR beamformer in Eq. ...

work page
[5]

The training and 3 testing conditions are summarized in Table 1

and the clean speech in TIMIT corpus [34] were used for making multichannel signals. The training and 3 testing conditions are summarized in Table 1. The number of micro- phones and sources were set to 2 where 2 microphones were randomly selected for each sample from the microphone ar- rangement shown in Table 1. For training,20000 speeches were selected,...

work page
[6]

Two multichannel loss functions, used in the proposed methods, evaluate the spatial covariance matrices based on two types of MISD

Conclusion In this paper, we proposed two mask-based beamforming meth- ods using DNNs trained by multichannel loss functions. Two multichannel loss functions, used in the proposed methods, evaluate the spatial covariance matrices based on two types of MISD. The experimental results indicate that the mask- based beamforming with the multichannel loss funct...

work page
[7]

Kohei Yatabe for his valu- able comments and discussion

Acknowledgements The authors would like to thank Dr. Kohei Yatabe for his valu- able comments and discussion

work page
[8]

Acoustic mod- eling for google home,

B. Li, T. N. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Pundak, K. Chin, K. C. Sim, R. J. Weiss, K. W. Wilson, E. Variani, C. Kim, O. Siohan, M. Wein- traub, E. McDermott, R. Rose, and M. Shannon, “Acoustic mod- eling for google home,” in Proc. Interspeech, Aug. 2017, pp. 399– 403

work page 2017
[9]

Watanabe, M

S. Watanabe, M. Delcroix, F. Metze, and J. R. Hershey, Eds., New Era for Robust Speech Recognition: Exploiting Deep Learning . Springer, 2017

work page 2017
[10]

Low-latency real-time blind source separation for hearing aids based on time-domain implementation of online independent vector analysis with trun- cation of non-causal components,

M. Sunohara, C. Haruta, and N. Ono, “Low-latency real-time blind source separation for hearing aids based on time-domain implementation of online independent vector analysis with trun- cation of non-causal components,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 216–220

work page 2017
[11]

Supervised speech separation based on deep learning: An overview,

D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, Oct. 2018

work page 2018
[12]

A consolidated perspective on multimicrophone speech enhance- ment and source separation,

S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhance- ment and source separation,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 25, no. 4, pp. 692–730, Apr. 2017

work page 2017
[13]

Blind separation of convolved mixtures in the frequency domain,

P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Neurocomputing, vol. 22, no. 1, pp. 21–34, 1998

work page 1998
[14]

Blind source separation based on a fast-convergence algorithm combining ICA and beamforming,

H. Saruwatari, T. Kawamura, T. Nishikawa, A. Lee, and K. Shikano, “Blind source separation based on a fast-convergence algorithm combining ICA and beamforming,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 2, pp. 666–678, Mar. 2006

work page 2006
[15]

Determined blind source separa- tion via proximal splitting algorithm,

K. Yatabe and D. Kitamura, “Determined blind source separa- tion via proximal splitting algorithm,” in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr. 2018, pp. 776–780

work page 2018
[16]

Under- determined reverberant audio source separation using a full-rank spatial covariance model,

N. Q. K. Duong, E. Vincent, and R. Gribonval, “Under- determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE Trans Audio, Speech, Lang. Pro- cess., vol. 18, no. 7, pp. 1830–1840, Sep. 2010

work page 2010
[17]

Multi-talker speech separation based on permutation invariant training and beamform- ing,

L. Yin, Z. Wang, R. Xia, J. Li, and Y . Yan, “Multi-talker speech separation based on permutation invariant training and beamform- ing,” in Interspeech, Sep. 2018, pp. 851–855

work page 2018
[18]

Multi- microphone neural speech separation for far-ﬁeld multi-talker speech recognition,

T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multi- microphone neural speech separation for far-ﬁeld multi-talker speech recognition,” in IEEE Int. Conf. on Acoust., Speech Sig- nal Process. (ICASSP), Apr. 2018, pp. 5739–5743

work page 2018
[19]

Combining spectral and spatial features for deep learning based blind speaker separation,

Z. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 2, pp. 457–468, Feb. 2019

work page 2019
[20]

Tight integration of spatial and spectral features for BSS with deep clustering embeddings,

L. Drude and R. Haeb-Umbach, “Tight integration of spatial and spectral features for BSS with deep clustering embeddings,” in Interspeech, Aug. 2017, pp. 2650–2654

work page 2017
[21]

Neural network based spectral mask estimation for acoustic beamforming,

J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP) , Mar. 2016, pp. 196–200

work page 2016
[22]

Beamnet: End-to-end training of a beamformer- supported multi-channel ASR system,

J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb- Umbach, “Beamnet: End-to-end training of a beamformer- supported multi-channel ASR system,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), 2017

work page 2017
[23]

Uni- ﬁed architecture for multichannel end-to-end speech recognition with neural beamforming,

T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao, “Uni- ﬁed architecture for multichannel end-to-end speech recognition with neural beamforming,” IEEE J. Selected Topics Signal Pro- cess., vol. 11, no. 8, pp. 1274–1288, Dec. 2017

work page 2017
[24]

Neural network adaptive beamforming for robust multichannel speech recognition,

B. Li, T. N. Sainath, R. J. Weiss, K. W. Wilson, and M. Bacchiani, “Neural network adaptive beamforming for robust multichannel speech recognition,” in Interspeech, 2016, pp. 1976–1980

work page 2016
[25]

Deep beam- forming networks for multi-channel speech recognition,

X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L. Seltzer, G. Chen, Y . Zhang, M. Mandel, and D. Yu, “Deep beam- forming networks for multi-channel speech recognition,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), 2016, pp. 5745–5749

work page 2016
[26]

On optimal frequency- domain multichannel linear ﬁltering for noise reduction,

M. Souden, J. Benesty, and S. Affes, “On optimal frequency- domain multichannel linear ﬁltering for noise reduction,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 2, pp. 260–276, Feb. 2010

work page 2010
[27]

Blind acoustic beamforming based on generalized eigenvalue decomposition,

E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,”IEEE Trans. Au- dio, Speech, Language Process , vol. 15, no. 5, pp. 1529–1539, 2007

work page 2007
[28]

GSVD-based optimal ﬁltering for sin- gle and multimicrophone speech enhancement,

S. Doclo and M. Moonen, “GSVD-based optimal ﬁltering for sin- gle and multimicrophone speech enhancement,” IEEE Trans. Sig- nal Process., vol. 50, no. 9, pp. 2230–2244, Sep. 2002

work page 2002
[29]

Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,

D. Yu, M. Kolbk, Z. Tan, and J. Jensen, “Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 241–245

work page 2017
[30]

Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,

M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,”IEEE/ACM Trans. Audio, Speech Lang. Proc., vol. 25, no. 10, pp. 1901–1913, Oct. 2017

work page 1901
[31]

Deep clustering: Discriminative embeddings for segmentation and sep- aration,

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and sep- aration,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2016, pp. 31–35

work page 2016
[32]

Deep attractor network for single-microphone speaker separation,

Z. Chen, Y . Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP) , Mar. 2017, pp. 246– 250

work page 2017
[33]

Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,

Z. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 1–5

work page 2018
[34]

Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,

H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2015, pp. 708–712

work page 2015
[35]

Multi-channel itakura saito distance minimization with deep neural network,

M. Togami, “Multi-channel itakura saito distance minimization with deep neural network,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), May 2019

work page 2019
[36]

Frame-by-frame closed-form update for mask-based adaptive MVDR beamforming,

T. Higuchi, K. Kinoshita, N. Ito, S. Karita, and T. Nakatani, “Frame-by-frame closed-form update for mask-based adaptive MVDR beamforming,” in IEEE Int. Conf. on Acoust., Speech Sig- nal Process. (ICASSP), 2018, pp. 531–535

work page 2018
[37]

Blind speech separation in a meeting situation with maximum SNR beamformers,

S. Araki, H. Sawada, and S. Makino, “Blind speech separation in a meeting situation with maximum SNR beamformers,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP) , vol. 1, Apr. 2007, pp. 41–44

work page 2007
[38]

Robust ASR using neural network based speech enhancement and feature sim- ulation,

S. Sivasankaran, A. A. Nugraha, E. Vincent, J. A. Morales- Cordovilla, S. Dalmia, I. Illina, and A. Liutkus, “Robust ASR using neural network based speech enhancement and feature sim- ulation,” in IEEE Workshop Autom. Speech Recognit. Underst. (ASRU), Dec. 2015, pp. 482–489

work page 2015
[39]

Multichan- nel extensions of non-negative matrix factorization with complex- valued data,

H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichan- nel extensions of non-negative matrix factorization with complex- valued data,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 971–982, May 2013

work page 2013
[40]

Multichannel au- dio database in various acoustic environments,

E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel au- dio database in various acoustic environments,” in Int. Workshop Acoust. Signal Enhance. (IWAENC), Sep. 2014, pp. 31–317

work page 2014
[41]

DARPA TIMIT acoustic-phonetic continous speech cor- pus CD-ROM,

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech cor- pus CD-ROM,” 1993

work page 1993
[42]

Performance mea- surement in blind audio source separation,

E. Vincent, R. Gribonval, and C. F ´evotte, “Performance mea- surement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006

work page 2006

[1] [1]

Introduction Speech source separation is a fundamental technique with many applications including automatic speech recognition (ASR) [1, 2] and hearing aid [3]. Although speech source separation with a single microphone is applicable [4], that with multiple micro- phones is more effective because it can take advantage of spatial information [5]. There exi...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

Preriminary 2.1. Mask-based beamforming Let N source signals be observed by M microphones, xt,f,m be the observed mixture, and ct,f,n,m be the nth source sig- nal observed at the mth microphone where t = 1 , . . . , T and f = 1 , . . . , F are time and frequency indices, respectively. A separated source ˆct,f,n obtained by beamforming is given as ˆct,f,n ...

work page

[3] [3]

Proposed mask-based beamforming with multichannel loss function In this paper, we propose two mask-based beamforming meth- ods using DNNs trained by multichannel loss functions which evaluate the estimated spatial covariance matrices as illustrated in Fig. 1. After reviewing a multichannel loss function for time- varying MWF [28], the proposed time-invari...

work page

[4] [4]

(3) [10] and by multi- channel loss functions were compared in speaker-independent multi-talker separation by the mask-based beamforming

Experiment In order to conﬁrm the effectiveness of the multichannel loss functions, DNNs trained by PSA in Eq. (3) [10] and by multi- channel loss functions were compared in speaker-independent multi-talker separation by the mask-based beamforming. Based on the spatial covariance matrices estimated by TF-masking, three beamformers (MVDR beamformer in Eq. ...

work page

[5] [5]

The training and 3 testing conditions are summarized in Table 1

and the clean speech in TIMIT corpus [34] were used for making multichannel signals. The training and 3 testing conditions are summarized in Table 1. The number of micro- phones and sources were set to 2 where 2 microphones were randomly selected for each sample from the microphone ar- rangement shown in Table 1. For training,20000 speeches were selected,...

work page

[6] [6]

Two multichannel loss functions, used in the proposed methods, evaluate the spatial covariance matrices based on two types of MISD

Conclusion In this paper, we proposed two mask-based beamforming meth- ods using DNNs trained by multichannel loss functions. Two multichannel loss functions, used in the proposed methods, evaluate the spatial covariance matrices based on two types of MISD. The experimental results indicate that the mask- based beamforming with the multichannel loss funct...

work page

[7] [7]

Kohei Yatabe for his valu- able comments and discussion

Acknowledgements The authors would like to thank Dr. Kohei Yatabe for his valu- able comments and discussion

work page

[8] [8]

Acoustic mod- eling for google home,

B. Li, T. N. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Pundak, K. Chin, K. C. Sim, R. J. Weiss, K. W. Wilson, E. Variani, C. Kim, O. Siohan, M. Wein- traub, E. McDermott, R. Rose, and M. Shannon, “Acoustic mod- eling for google home,” in Proc. Interspeech, Aug. 2017, pp. 399– 403

work page 2017

[9] [9]

Watanabe, M

S. Watanabe, M. Delcroix, F. Metze, and J. R. Hershey, Eds., New Era for Robust Speech Recognition: Exploiting Deep Learning . Springer, 2017

work page 2017

[10] [10]

Low-latency real-time blind source separation for hearing aids based on time-domain implementation of online independent vector analysis with trun- cation of non-causal components,

M. Sunohara, C. Haruta, and N. Ono, “Low-latency real-time blind source separation for hearing aids based on time-domain implementation of online independent vector analysis with trun- cation of non-causal components,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 216–220

work page 2017

[11] [11]

Supervised speech separation based on deep learning: An overview,

D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, Oct. 2018

work page 2018

[12] [12]

A consolidated perspective on multimicrophone speech enhance- ment and source separation,

S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhance- ment and source separation,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 25, no. 4, pp. 692–730, Apr. 2017

work page 2017

[13] [13]

Blind separation of convolved mixtures in the frequency domain,

P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Neurocomputing, vol. 22, no. 1, pp. 21–34, 1998

work page 1998

[14] [14]

Blind source separation based on a fast-convergence algorithm combining ICA and beamforming,

H. Saruwatari, T. Kawamura, T. Nishikawa, A. Lee, and K. Shikano, “Blind source separation based on a fast-convergence algorithm combining ICA and beamforming,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 2, pp. 666–678, Mar. 2006

work page 2006

[15] [15]

Determined blind source separa- tion via proximal splitting algorithm,

K. Yatabe and D. Kitamura, “Determined blind source separa- tion via proximal splitting algorithm,” in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr. 2018, pp. 776–780

work page 2018

[16] [16]

Under- determined reverberant audio source separation using a full-rank spatial covariance model,

N. Q. K. Duong, E. Vincent, and R. Gribonval, “Under- determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE Trans Audio, Speech, Lang. Pro- cess., vol. 18, no. 7, pp. 1830–1840, Sep. 2010

work page 2010

[17] [17]

Multi-talker speech separation based on permutation invariant training and beamform- ing,

L. Yin, Z. Wang, R. Xia, J. Li, and Y . Yan, “Multi-talker speech separation based on permutation invariant training and beamform- ing,” in Interspeech, Sep. 2018, pp. 851–855

work page 2018

[18] [18]

Multi- microphone neural speech separation for far-ﬁeld multi-talker speech recognition,

T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multi- microphone neural speech separation for far-ﬁeld multi-talker speech recognition,” in IEEE Int. Conf. on Acoust., Speech Sig- nal Process. (ICASSP), Apr. 2018, pp. 5739–5743

work page 2018

[19] [19]

Combining spectral and spatial features for deep learning based blind speaker separation,

Z. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 2, pp. 457–468, Feb. 2019

work page 2019

[20] [20]

Tight integration of spatial and spectral features for BSS with deep clustering embeddings,

L. Drude and R. Haeb-Umbach, “Tight integration of spatial and spectral features for BSS with deep clustering embeddings,” in Interspeech, Aug. 2017, pp. 2650–2654

work page 2017

[21] [21]

Neural network based spectral mask estimation for acoustic beamforming,

J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP) , Mar. 2016, pp. 196–200

work page 2016

[22] [22]

Beamnet: End-to-end training of a beamformer- supported multi-channel ASR system,

J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb- Umbach, “Beamnet: End-to-end training of a beamformer- supported multi-channel ASR system,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), 2017

work page 2017

[23] [23]

Uni- ﬁed architecture for multichannel end-to-end speech recognition with neural beamforming,

T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao, “Uni- ﬁed architecture for multichannel end-to-end speech recognition with neural beamforming,” IEEE J. Selected Topics Signal Pro- cess., vol. 11, no. 8, pp. 1274–1288, Dec. 2017

work page 2017

[24] [24]

Neural network adaptive beamforming for robust multichannel speech recognition,

B. Li, T. N. Sainath, R. J. Weiss, K. W. Wilson, and M. Bacchiani, “Neural network adaptive beamforming for robust multichannel speech recognition,” in Interspeech, 2016, pp. 1976–1980

work page 2016

[25] [25]

Deep beam- forming networks for multi-channel speech recognition,

X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L. Seltzer, G. Chen, Y . Zhang, M. Mandel, and D. Yu, “Deep beam- forming networks for multi-channel speech recognition,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), 2016, pp. 5745–5749

work page 2016

[26] [26]

On optimal frequency- domain multichannel linear ﬁltering for noise reduction,

M. Souden, J. Benesty, and S. Affes, “On optimal frequency- domain multichannel linear ﬁltering for noise reduction,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 2, pp. 260–276, Feb. 2010

work page 2010

[27] [27]

Blind acoustic beamforming based on generalized eigenvalue decomposition,

E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,”IEEE Trans. Au- dio, Speech, Language Process , vol. 15, no. 5, pp. 1529–1539, 2007

work page 2007

[28] [28]

GSVD-based optimal ﬁltering for sin- gle and multimicrophone speech enhancement,

S. Doclo and M. Moonen, “GSVD-based optimal ﬁltering for sin- gle and multimicrophone speech enhancement,” IEEE Trans. Sig- nal Process., vol. 50, no. 9, pp. 2230–2244, Sep. 2002

work page 2002

[29] [29]

Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,

D. Yu, M. Kolbk, Z. Tan, and J. Jensen, “Permutation invari- ant training of deep models for speaker-independent multi-talker speech separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 241–245

work page 2017

[30] [30]

Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,

M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,”IEEE/ACM Trans. Audio, Speech Lang. Proc., vol. 25, no. 10, pp. 1901–1913, Oct. 2017

work page 1901

[31] [31]

Deep clustering: Discriminative embeddings for segmentation and sep- aration,

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and sep- aration,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2016, pp. 31–35

work page 2016

[32] [32]

Deep attractor network for single-microphone speaker separation,

Z. Chen, Y . Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP) , Mar. 2017, pp. 246– 250

work page 2017

[33] [33]

Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,

Z. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 1–5

work page 2018

[34] [34]

Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,

H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase- sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2015, pp. 708–712

work page 2015

[35] [35]

Multi-channel itakura saito distance minimization with deep neural network,

M. Togami, “Multi-channel itakura saito distance minimization with deep neural network,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), May 2019

work page 2019

[36] [36]

Frame-by-frame closed-form update for mask-based adaptive MVDR beamforming,

T. Higuchi, K. Kinoshita, N. Ito, S. Karita, and T. Nakatani, “Frame-by-frame closed-form update for mask-based adaptive MVDR beamforming,” in IEEE Int. Conf. on Acoust., Speech Sig- nal Process. (ICASSP), 2018, pp. 531–535

work page 2018

[37] [37]

Blind speech separation in a meeting situation with maximum SNR beamformers,

S. Araki, H. Sawada, and S. Makino, “Blind speech separation in a meeting situation with maximum SNR beamformers,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP) , vol. 1, Apr. 2007, pp. 41–44

work page 2007

[38] [38]

Robust ASR using neural network based speech enhancement and feature sim- ulation,

S. Sivasankaran, A. A. Nugraha, E. Vincent, J. A. Morales- Cordovilla, S. Dalmia, I. Illina, and A. Liutkus, “Robust ASR using neural network based speech enhancement and feature sim- ulation,” in IEEE Workshop Autom. Speech Recognit. Underst. (ASRU), Dec. 2015, pp. 482–489

work page 2015

[39] [39]

Multichan- nel extensions of non-negative matrix factorization with complex- valued data,

H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichan- nel extensions of non-negative matrix factorization with complex- valued data,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 971–982, May 2013

work page 2013

[40] [40]

Multichannel au- dio database in various acoustic environments,

E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel au- dio database in various acoustic environments,” in Int. Workshop Acoust. Signal Enhance. (IWAENC), Sep. 2014, pp. 31–317

work page 2014

[41] [41]

DARPA TIMIT acoustic-phonetic continous speech cor- pus CD-ROM,

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech cor- pus CD-ROM,” 1993

work page 1993

[42] [42]

Performance mea- surement in blind audio source separation,

E. Vincent, R. Gribonval, and C. F ´evotte, “Performance mea- surement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006

work page 2006