Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

Kenji Nagamatsu; Naoyuki Kanda; Ryoichi Takashima; Shinji Watanabe; Shota Horiguchi; Yusuke Fujita

REVIEW 2 major objections 1 minor 49 references

An auxiliary loss that also recognizes interference speakers improves target-speaker ASR from mixed audio.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 16:09 UTC pith:J3W5P7EA

load-bearing objection The auxiliary loss gives a modest 6.6% relative WER drop on two-speaker ASR but the abstract supplies no evidence that the claimed regularization mechanism is what drives it. the 2 major comments →

arxiv 1906.10876 v1 pith:J3W5P7EA submitted 2019-06-26 cs.CL cs.SDeess.AS

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

Naoyuki Kanda , Shota Horiguchi , Ryoichi Takashima , Yusuke Fujita , Kenji Nagamatsu , Shinji Watanabe This is my paper

classification cs.CL cs.SDeess.AS

keywords target-speaker ASRauxiliary lossspeaker separationmixed speechinterference speakersword error ratemonaural mixture

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an auxiliary loss for target-speaker automatic speech recognition that processes a monaural mixture of speakers given a short target-speaker sample. The loss adds a term that maximizes transcription accuracy on the non-target speakers during training. The goal is to produce internal representations that better separate the target speaker from the mixture. On two-speaker test mixtures the method lowers word error rate by 6.6 percent relative to a strong lattice-free maximum mutual information baseline. The same auxiliary output branch can also transcribe the interference speakers as a secondary task.

Core claim

The auxiliary interference speaker loss, by additionally maximizing ASR accuracy on interference speakers, regularizes the network to achieve better speaker separation representations and thereby higher accuracy on the target-speaker ASR task.

What carries the argument

Auxiliary interference speaker loss added to the training objective.

Load-bearing premise

Training the model to recognize interference speakers will improve the internal separation of the target speaker rather than introduce conflicting gradients.

What would settle it

Train the model with the auxiliary loss and measure whether target-speaker word error rate on held-out mixtures fails to drop below the 18.06 percent baseline.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Word error rate falls from 18.06 percent to 16.87 percent on the two-speaker test set.
The auxiliary output branch can be used directly as a secondary ASR system for interference speakers.
The improvement holds across a range of signal-to-interference ratio conditions.
The approach starts from a lattice-free maximum mutual information baseline that already outperforms a normal clean-speech ASR by a large margin.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same auxiliary-loss idea could be applied to mixtures containing more than two speakers by adding one branch per additional speaker.
The technique may reduce reliance on separate speaker-separation front-ends in fully end-to-end multi-speaker systems.
Gains could compound if the underlying single-speaker ASR model continues to improve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The auxiliary loss gives a modest 6.6% relative WER drop on two-speaker ASR but the abstract supplies no evidence that the claimed regularization mechanism is what drives it.

read the letter

The paper adds an auxiliary branch that trains the model to also recognize the interfering speaker, on the theory that this regularizes the shared layers toward better target-speaker separation. On two-speaker mixtures they move from 18.06% WER with a strong LF-MMI baseline to 16.87%, and note that the auxiliary branch can double as an interference-speaker recognizer. That is the concrete new piece: the specific auxiliary objective is not described in the prior work they cite, and the side use of the branch is a practical plus. The baseline itself is competitive, which makes the relative gain worth noticing for anyone doing multi-speaker ASR. The soft spots are exactly where the stress-test note flags them. The abstract gives no loss weighting, no training schedule details, no auxiliary-task WER curves, and no check on whether the two objectives produce aligned or opposing gradients on the encoder. Without those, the 6.6% drop could come from extra capacity, different optimization dynamics, or simple multi-task averaging rather than the intended separation regularization. The claim that maximizing interference accuracy helps target separation therefore rests on an untested assumption. This is a narrow but real empirical result in speech processing. A reader working on target-speaker or multi-speaker ASR would find the number useful and might try the auxiliary branch. It is not a broad methodological advance. I would send it to peer review because the task is relevant, the baseline is solid, and the reported gain is falsifiable; the reviewers can ask for the missing controls and ablations.

Referee Report

2 major / 1 minor

Summary. The paper proposes an auxiliary interference-speaker loss for target-speaker ASR on monaural mixtures. Given a short target-speaker enrollment, the method trains a network (based on lattice-free MMI) to transcribe the target while an auxiliary branch maximizes ASR accuracy on the interference speaker(s). The authors claim this auxiliary objective regularizes shared layers toward better speaker-separation representations, yielding a 6.6% relative WER reduction (18.06% → 16.87%) on two-speaker test mixtures; they also note that the auxiliary branch can serve as a secondary ASR for interference speech.

Significance. If the reported WER reduction is reproducible and the regularization mechanism is confirmed, the approach would offer a lightweight multi-task training recipe that improves target-speaker ASR without architectural changes. The use of a strong LF-MMI baseline and the secondary utility of the auxiliary branch are practical strengths; however, the absence of loss-weighting details, gradient diagnostics, or controls for capacity versus regularization effects limits the result's immediate impact on the field.

major comments (2)

[Abstract / auxiliary loss description] Abstract and methods description of the auxiliary loss: the central claim that 'additionally maximizing interference speaker ASR accuracy during training ... will regularize the network to achieve a better representation for speaker separation' is presented without any reported loss weighting schedule, gradient-alignment measurements between the primary LF-MMI and auxiliary objectives, or ablation that isolates regularization from incidental multi-task or capacity effects. The 6.6% relative WER drop is therefore compatible with multiple alternative explanations.
[Evaluation / results] Evaluation section: no auxiliary-task WER trajectory, correlation between auxiliary and target WER, or statistical significance tests on the 18.06% → 16.87% difference are provided, making it impossible to assess whether the observed improvement is robust or merely within run-to-run variance.

minor comments (1)

[Abstract] The abstract states the baseline WER as 18.06% and the proposed result as 16.87% but does not specify the exact test-set size or number of runs, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the auxiliary loss mechanism and evaluation details. We address each major comment below and will revise the manuscript to improve clarity and transparency.

read point-by-point responses

Referee: [Abstract / auxiliary loss description] Abstract and methods description of the auxiliary loss: the central claim that 'additionally maximizing interference speaker ASR accuracy during training ... will regularize the network to achieve a better representation for speaker separation' is presented without any reported loss weighting schedule, gradient-alignment measurements between the primary LF-MMI and auxiliary objectives, or ablation that isolates regularization from incidental multi-task or capacity effects. The 6.6% relative WER drop is therefore compatible with multiple alternative explanations.

Authors: We acknowledge that the original manuscript does not report the loss weighting schedule, gradient diagnostics, or ablations isolating regularization from capacity effects. The auxiliary loss was combined with the primary LF-MMI loss using a fixed weighting factor during training; we will specify this value and the exact training configuration in the revised methods section. We will also add a brief discussion noting alternative explanations for the WER improvement while retaining the claim that the auxiliary objective encourages better separation, supported by the demonstrated utility of the auxiliary branch for interference ASR. This will be a partial revision focused on clarification rather than new experiments. revision: partial
Referee: [Evaluation / results] Evaluation section: no auxiliary-task WER trajectory, correlation between auxiliary and target WER, or statistical significance tests on the 18.06% → 16.87% difference are provided, making it impossible to assess whether the observed improvement is robust or merely within run-to-run variance.

Authors: We agree these analyses are absent. We will add the auxiliary-task WER trajectory during training and the correlation with target-speaker WER if the per-epoch logs are recoverable. For the WER difference, we will include a statement that the 6.6% relative reduction was observed on a single test-set evaluation and that run-to-run variance was not quantified. We will also note consistency of gains across SIR conditions as supporting context. This constitutes a partial revision by adding available details and acknowledging limitations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical multi-task training result with no self-referential derivation

full rationale

The paper proposes an auxiliary loss for multi-task training and reports an empirical WER reduction (18.06% to 16.87%) on target-speaker ASR. No derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction; the regularization mechanism is stated as an intended effect of the loss rather than derived from equations that loop back to fitted parameters or self-citations. The result is an observed outcome of training, not a tautological renaming or fitted-input prediction. Self-citations, if present, are not load-bearing for any central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that joint optimization of target and interference ASR objectives yields improved speaker separation representations. No free parameters or invented entities are mentioned in the abstract. Standard neural network training assumptions apply.

axioms (1)

domain assumption Jointly optimizing target-speaker and interference-speaker ASR objectives produces better internal representations for speaker separation than target-only training.
This premise is invoked to justify why the auxiliary loss improves the main task.

pith-pipeline@v0.9.0 · 5768 in / 1283 out tokens · 47098 ms · 2026-05-25T16:09:55.929410+00:00 · methodology

0 comments

read the original abstract

In this paper, we propose a novel auxiliary loss function for target-speaker automatic speech recognition (ASR). Our method automatically extracts and transcribes target speaker's utterances from a monaural mixture of multiple speakers speech given a short sample of the target speaker. The proposed auxiliary loss function attempts to additionally maximize interference speaker ASR accuracy during training. This will regularize the network to achieve a better representation for speaker separation, thus achieving better accuracy on the target-speaker ASR. We evaluated our proposed method using two-speaker-mixed speech in various signal-to-interference-ratio conditions. We first built a strong target-speaker ASR baseline based on the state-of-the-art lattice-free maximum mutual information. This baseline achieved a word error rate (WER) of 18.06% on the test set while a normal ASR trained with clean data produced a completely corrupted result (WER of 84.71%). Then, our proposed loss further reduced the WER by 6.6% relative to this strong baseline, achieving a WER of 16.87%. In addition to the accuracy improvement, we also showed that the auxiliary output branch for the proposed loss can even be used for a secondary ASR for interference speakers' speech.

Figures

Figures reproduced from arXiv: 1906.10876 by Kenji Nagamatsu, Naoyuki Kanda, Ryoichi Takashima, Shinji Watanabe, Shota Horiguchi, Yusuke Fujita.

**Figure 1.** Figure 1: Overview of target-speaker AM architecture with auxiliary interference speaker loss. Auxiliary networks for interference speaker loss are only used in training and normally removed in the decoding phase. A number with an arrow indicates a time splicing index, which forms the basis of a time-delay neural network (TDNN) [26]. The input features were advanced by five frames, which has the same effect as r… view at source ↗

**Figure 2.** Figure 2: Model architectures with early, middle, and late splitting. Note that the middle splitting model was used as the default model in other experiments [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison among model architectures in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

[1]

wsj0- 2mix

Introduction Thanks to the recent advances in deep-learning [1–3], the ac cu- racy of automatic speech recognition (ASR) for some dataset s have become close to (e.g., Switchboard [4–6]) or beyond (e. g., LibriSpeech in [7] and [8]) the level of human transcribers. However, despite this progress, the accuracy of multi-talk er speech recognition is still v...

work page
[2]

Auxiliary Interference Speaker Loss In this section, we explain our proposed method and its use with an LF-MMI-based AM due to its state-of-the-art perfor- mance [23, 27, 28]. However, it should be noted that our work can be extended to any kind of ASR loss like cross entropy (CE) [1], state-level minimum Bayes risk (sMBR) [29, 30], an d lattice-free sMBR...

work page
[3]

dev-clean

Evaluation 3.1. Experiment with LibriSpeech 3.1.1. Experimental settings For our primary evaluation, we used the LibriSpeech corpus [36], which consists of about 1,000 hours of read-aloud En- glish speech. In this study, we used 100 hours of the clean portion of the dataset for training the AMs. For evaluation, we used “dev-clean” and “test-clean” in acco...

work page
[4]

Prepare a list of speech samples (main list), which is the main target of ASR

work page
[5]

Shufﬂe the main list to create a second list under the con- straint that the same speaker does not appear in the same line in the main and second lists

work page
[6]

For training data, we randomly sampled an SIR value from uniform distribution between -10 dB and 10 dB for each mixture

Mix the audio in the main list and the second list one-by- one, with a speciﬁc SIR. For training data, we randomly sampled an SIR value from uniform distribution between -10 dB and 10 dB for each mixture. For the development and evaluation data, we generated data with an SIR of {10, 5, 0, -5, -10 } dB

work page
[7]

Note that, in accordance with the protocol above, the speech of the target speaker could be much shorter or much longer than that of the interference speaker

Only in the case of the training data, the volume of each mixed speech was randomly changed to enhance robust- ness for the volume difference. Note that, in accordance with the protocol above, the speech of the target speaker could be much shorter or much longer than that of the interference speaker. We intentionally sel ected this protocol because we bel...

work page
[8]

clean AM

with scales of 0.00005 and 0.1, respectively. The leaky hidden Markov model coefﬁcient was set to 0.1. In addition, a backstitch technique [40] with a backstitch scale of 1.0 an d backstitch interval of 4 was used. For comparison purposes, we trained the AM without the proposed auxiliary loss, which corresponds to the original target-speaker model. We als...

work page
[9]

We evaluated our proposed method using two-speaker-mixed speech in various SIR conditions

Conclusions In this paper, we proposed a novel auxiliary loss function fo r target-speaker ASR, in which it attempts to maximize interf er- ence speaker ASR accuracy. We evaluated our proposed method using two-speaker-mixed speech in various SIR conditions. We ﬁrst built a strong target-speaker ASR baseline based on the state-of-the-art LF-MMI, achieving ...

work page
[10]

Conversational speech transc ription using context-dependent deep neural networks,

F. Seide, G. Li, and D. Y u, “Conversational speech transc ription using context-dependent deep neural networks,” in Proc. INTER- SPEECH, 2011, pp. 437–440

work page 2011
[11]

Context-depende nt pre-trained deep neural networks for large-vocabulary spe ech recognition,

G. E. Dahl, D. Y u, L. Deng, and A. Acero, “Context-depende nt pre-trained deep neural networks for large-vocabulary spe ech recognition,” IEEE Trans. on ASLP , vol. 20, no. 1, pp. 30–42, 2012

work page 2012
[12]

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,

G. Hinton, L. Deng, D. Y u, G. E. Dahl, A.-r. Mohamed, N. Jai tly, A. Senior, V . V anhoucke, P . Nguyen, T. N. Sainath, and B. Kings- bury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012

work page 2012
[13]

Achieving Human Parity in Conversational Speech Recognition

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Sto lcke, D. Y u, and G. Zweig, “Achieving human parity in conversation al speech recognition,” arXiv preprint arXiv:1610.05256, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

English conversational telephone speech recognition by h umans and machines,

G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. D im- itriadis, X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim et al. , “English conversational telephone speech recognition by h umans and machines,” Proc. INTERSPEECH, pp. 132–136, 2017

work page 2017
[15]

Toward human parity in conversa- tional speech recognition,

W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer, A. S tol- cke, D. Y u, and G. Zweig, “Toward human parity in conversa- tional speech recognition,” IEEE/ACM Trans. on ASLP , vol. 25, no. 12, pp. 2410–2423, 2017

work page 2017
[16]

Deep speech 2: End-to-end speech recognition in English an d Mandarin,

D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Ba tten- berg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in English an d Mandarin,” in Proc. ICML, 2016, pp. 173–182

work page 2016
[17]

Lattice-free sta te-level minimum Bayes risk training of acoustic models,

N. Kanda, Y . Fujita, and K. Nagamatsu, “Lattice-free sta te-level minimum Bayes risk training of acoustic models,” in Proc. IN- TERSPEECH, 2018, pp. 2923–2927

work page 2018
[18]

Rec- ognizing overlapped speech in meetings: A multichannel sep ara- tion approach using neural networks,

T. Y oshioka, H. Erdogan, Z. Chen, X. Xiao, and F. Alleva, “ Rec- ognizing overlapped speech in meetings: A multichannel sep ara- tion approach using neural networks,” in Proc. INTERSPEECH, 2018, pp. 3038–3042

work page 2018
[19]

The ﬁf th CHiME speech separation and recognition challenge: Datase t, task and baselines,

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The ﬁf th CHiME speech separation and recognition challenge: Datase t, task and baselines,” in Proc. INTERSPEECH , 2018, pp. 1561– 1565

work page 2018
[20]

The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using mu lti- ple microphone arrays,

N. Kanda, R. Ikeshita, S. Horiguchi, Y . Fujita, K. Nagam atsu, X. Wang, V . Manohar, N. E. Y . Soplin, M. Maciejewski, S.-J. Chen et al. , “The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using mu lti- ple microphone arrays,” in Proc. CHiME-5, 2018, pp. 6–10

work page 2018
[21]

Acoustic modeling for distant multi-talker s peech recognition with single- and multi-channel branches,

N. Kanda, Y . Fujita, S. Horiguchi, R. Ikeshita, K. Nagam atsu, and S. Watanabe, “Acoustic modeling for distant multi-talker s peech recognition with single- and multi-channel branches,” in Proc. ICASSP, 2019

work page 2019
[22]

Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,” in Proc. ICASSP, 2016, pp. 31–35

work page 2016
[23]

Deep attractor netwo rk for single-microphone speaker separation,

Z. Chen, Y . Luo, and N. Mesgarani, “Deep attractor netwo rk for single-microphone speaker separation,” in Proc. ICASSP , 2017, pp. 246–250

work page 2017
[24]

Recognizing multi-talker speech with permutation invariant training,

D. Y u, X. Chang, and Y . Qian, “Recognizing multi-talker speech with permutation invariant training,” in Proc. INTERSPEECH , 2017, pp. 2456–2460

work page 2017
[25]

Progressive joint modeling in unsupervised sing le- channel overlapped speech recognition,

Z. Chen, J. Droppo, J. Li, W. Xiong, Z. Chen, J. Droppo, J. Li, and W. Xiong, “Progressive joint modeling in unsupervised sing le- channel overlapped speech recognition,” IEEE/ACM Trans. on ASLP, vol. 26, no. 1, pp. 184–196, 2018

work page 2018
[26]

End-to-end multi-speaker speech recognition,

S. Settle, J. Le Roux, T. Hori, S. Watanabe, and J. R. Hers hey, “End-to-end multi-speaker speech recognition,” in Proc. ICASSP, 2018, pp. 4819–4823

work page 2018
[27]

A purely end-to-end system for multi-speaker speech recogni tion,

H. Seki, T. Hori, S. Watanabe, J. Le Roux, and J. R. Hershe y, “A purely end-to-end system for multi-speaker speech recogni tion,” in Proc. ACL, 2018, pp. 2620–2630

work page 2018
[28]

End-to-end monaural multi-speaker asr system without pretraining,

X. Chang, Y . Qian, K. Y u, and S. Watanabe, “End-to-end monaural multi-speaker asr system without pretraining,” i n Proc. ICASSP, 2019

work page 2019
[29]

Permutation invari- ant training of deep models for speaker-independent multi- talker speech separation,

D. Y u, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invari- ant training of deep models for speaker-independent multi- talker speech separation,” in Proc. ICASSP, 2017, pp. 241–245

work page 2017
[30]

Speaker-aware neural network based beam- former for speaker extraction in speech mixtures,

K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani, “Speaker-aware neural network based beam- former for speaker extraction in speech mixtures,” in Proc. IN- TERSPEECH, 2017

work page 2017
[31]

Single channel target speaker extraction and recog- nition with speaker beam,

M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recog- nition with speaker beam,” in ICASSP, 2018, pp. 5554–5558

work page 2018
[32]

Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,

D. Povey, V . Peddinti, D. Galvez, P . Ghahrmani, V . Manoh ar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,” inProc. INTER- SPEECH, 2016, pp. 2751–2755

work page 2016
[33]

Single-channel multi-speaker separation using deep clus tering,

Y . Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershe y, “Single-channel multi-speaker separation using deep clus tering,” in Proc. INTERSPEECH, 2016, pp. 545–549

work page 2016
[34]

Single-channel multi-tal ker speech recognition with permutation invariant training,

Y . Qian, X. Chang, and D. Y u, “Single-channel multi-tal ker speech recognition with permutation invariant training,” Speech Communication, vol. 104, pp. 1–11, 2018

work page 2018
[35]

A time delay ne ural network architecture for efﬁcient modeling of long tempora l con- texts,

V . Peddinti, D. Povey, and S. Khudanpur, “A time delay ne ural network architecture for efﬁcient modeling of long tempora l con- texts,” in Proc. INTERSPEECH, 2015, pp. 3214–3218

work page 2015
[36]

Investigation o f lattice- free maximum mutual information-based acoustic models wit h sequence-level Kullback-Leibler divergence,

N. Kanda, Y . Fujita, and K. Nagamatsu, “Investigation o f lattice- free maximum mutual information-based acoustic models wit h sequence-level Kullback-Leibler divergence,” in Proc. ASRU , 2017, pp. 69–76

work page 2017
[37]

Sequence distil lation for purely sequence trained acoustic models,

N. Kanda, Y . Fujita, and K. Nagamatsu, “Sequence distil lation for purely sequence trained acoustic models,” in Proc. ICASSP, 2018, pp. 5964–5968

work page 2018
[38]

Sequen ce- discriminative training of deep neural networks,

K. V esel` y, A. Ghoshal, L. Burget, and D. Povey, “Sequen ce- discriminative training of deep neural networks,” in Proc. INTER- SPEECH, 2013, pp. 2345–2349

work page 2013
[39]

Error back propagation for se- quence training of context-dependent deep networks for con ver- sational speech transcription,

H. Su, G. Li, D. Y u, and F. Seide, “Error back propagation for se- quence training of context-dependent deep networks for con ver- sational speech transcription,” in Proc. ICASSP, 2013, pp. 6664– 6668

work page 2013
[40]

Con- nectionist temporal classiﬁcation: labelling unsegmente d se- quence data with recurrent neural networks,

A. Graves, S. Fern´ andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classiﬁcation: labelling unsegmente d se- quence data with recurrent neural networks,” in Proc. ICML , 2006, pp. 369–376

work page 2006
[41]

End-t o- end continuous speech recognition using attention-based recurrent NN: First results,

J. Chorowski, D. Bahdanau, K. Cho, and Y . Bengio, “End-t o- end continuous speech recognition using attention-based recurrent NN: First results,” Proc. NIPS workshop on Deep Learning and Representation Learning, 2014

work page 2014
[42]

Listen, atten d and spell: A neural network for large vocabulary conversati onal speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, atten d and spell: A neural network for large vocabulary conversati onal speech recognition,” in Proc. ICASSP, 2016, pp. 4960–4964

work page 2016
[43]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P . J. Kenny, R. Dehak, P . Dumouchel, and P . Ouel let, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans. on ASLP, vol. 19, no. 4, pp. 788–798, 2011

work page 2011
[44]

Speaker adap- tation of neural network acoustic models using i-vectors,

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adap- tation of neural network acoustic models using i-vectors,” in Proc. ASRU, 2013, pp. 55–59

work page 2013
[45]

Lib- rispeech: an ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015

work page 2015
[46]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembe k, N. Goel, M. Hannemann, P . Motl´ ıˇ cek, Y . Qian, P . Schwarzet al., “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011

work page 2011
[47]

Phoneme recognition using time-delay neural networks,

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. L ang, “Phoneme recognition using time-delay neural networks,” IEEE Trans. ASSP, vol. 37, no. 3, pp. 328–339, 1989

work page 1989
[48]

Long short-term mem ory,

S. Hochreiter and J. Schmidhuber, “Long short-term mem ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[49]

Backstitch: Counteracting ﬁnite-sample bias via neg ative steps,

Y . Wang, V . Peddinti, H. Xu, X. Zhang, D. Povey, and S. Khudan- pur, “Backstitch: Counteracting ﬁnite-sample bias via neg ative steps,” in Proc. INTERSPEECH, 2017, pp. 1631–1635

work page 2017

[1] [1]

wsj0- 2mix

Introduction Thanks to the recent advances in deep-learning [1–3], the ac cu- racy of automatic speech recognition (ASR) for some dataset s have become close to (e.g., Switchboard [4–6]) or beyond (e. g., LibriSpeech in [7] and [8]) the level of human transcribers. However, despite this progress, the accuracy of multi-talk er speech recognition is still v...

work page

[2] [2]

Auxiliary Interference Speaker Loss In this section, we explain our proposed method and its use with an LF-MMI-based AM due to its state-of-the-art perfor- mance [23, 27, 28]. However, it should be noted that our work can be extended to any kind of ASR loss like cross entropy (CE) [1], state-level minimum Bayes risk (sMBR) [29, 30], an d lattice-free sMBR...

work page

[3] [3]

dev-clean

Evaluation 3.1. Experiment with LibriSpeech 3.1.1. Experimental settings For our primary evaluation, we used the LibriSpeech corpus [36], which consists of about 1,000 hours of read-aloud En- glish speech. In this study, we used 100 hours of the clean portion of the dataset for training the AMs. For evaluation, we used “dev-clean” and “test-clean” in acco...

work page

[4] [4]

Prepare a list of speech samples (main list), which is the main target of ASR

work page

[5] [5]

Shufﬂe the main list to create a second list under the con- straint that the same speaker does not appear in the same line in the main and second lists

work page

[6] [6]

For training data, we randomly sampled an SIR value from uniform distribution between -10 dB and 10 dB for each mixture

Mix the audio in the main list and the second list one-by- one, with a speciﬁc SIR. For training data, we randomly sampled an SIR value from uniform distribution between -10 dB and 10 dB for each mixture. For the development and evaluation data, we generated data with an SIR of {10, 5, 0, -5, -10 } dB

work page

[7] [7]

Note that, in accordance with the protocol above, the speech of the target speaker could be much shorter or much longer than that of the interference speaker

Only in the case of the training data, the volume of each mixed speech was randomly changed to enhance robust- ness for the volume difference. Note that, in accordance with the protocol above, the speech of the target speaker could be much shorter or much longer than that of the interference speaker. We intentionally sel ected this protocol because we bel...

work page

[8] [8]

clean AM

with scales of 0.00005 and 0.1, respectively. The leaky hidden Markov model coefﬁcient was set to 0.1. In addition, a backstitch technique [40] with a backstitch scale of 1.0 an d backstitch interval of 4 was used. For comparison purposes, we trained the AM without the proposed auxiliary loss, which corresponds to the original target-speaker model. We als...

work page

[9] [9]

We evaluated our proposed method using two-speaker-mixed speech in various SIR conditions

Conclusions In this paper, we proposed a novel auxiliary loss function fo r target-speaker ASR, in which it attempts to maximize interf er- ence speaker ASR accuracy. We evaluated our proposed method using two-speaker-mixed speech in various SIR conditions. We ﬁrst built a strong target-speaker ASR baseline based on the state-of-the-art LF-MMI, achieving ...

work page

[10] [10]

Conversational speech transc ription using context-dependent deep neural networks,

F. Seide, G. Li, and D. Y u, “Conversational speech transc ription using context-dependent deep neural networks,” in Proc. INTER- SPEECH, 2011, pp. 437–440

work page 2011

[11] [11]

Context-depende nt pre-trained deep neural networks for large-vocabulary spe ech recognition,

G. E. Dahl, D. Y u, L. Deng, and A. Acero, “Context-depende nt pre-trained deep neural networks for large-vocabulary spe ech recognition,” IEEE Trans. on ASLP , vol. 20, no. 1, pp. 30–42, 2012

work page 2012

[12] [12]

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,

G. Hinton, L. Deng, D. Y u, G. E. Dahl, A.-r. Mohamed, N. Jai tly, A. Senior, V . V anhoucke, P . Nguyen, T. N. Sainath, and B. Kings- bury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012

work page 2012

[13] [13]

Achieving Human Parity in Conversational Speech Recognition

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Sto lcke, D. Y u, and G. Zweig, “Achieving human parity in conversation al speech recognition,” arXiv preprint arXiv:1610.05256, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

English conversational telephone speech recognition by h umans and machines,

G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. D im- itriadis, X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim et al. , “English conversational telephone speech recognition by h umans and machines,” Proc. INTERSPEECH, pp. 132–136, 2017

work page 2017

[15] [15]

Toward human parity in conversa- tional speech recognition,

W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer, A. S tol- cke, D. Y u, and G. Zweig, “Toward human parity in conversa- tional speech recognition,” IEEE/ACM Trans. on ASLP , vol. 25, no. 12, pp. 2410–2423, 2017

work page 2017

[16] [16]

Deep speech 2: End-to-end speech recognition in English an d Mandarin,

D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Ba tten- berg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in English an d Mandarin,” in Proc. ICML, 2016, pp. 173–182

work page 2016

[17] [17]

Lattice-free sta te-level minimum Bayes risk training of acoustic models,

N. Kanda, Y . Fujita, and K. Nagamatsu, “Lattice-free sta te-level minimum Bayes risk training of acoustic models,” in Proc. IN- TERSPEECH, 2018, pp. 2923–2927

work page 2018

[18] [18]

Rec- ognizing overlapped speech in meetings: A multichannel sep ara- tion approach using neural networks,

T. Y oshioka, H. Erdogan, Z. Chen, X. Xiao, and F. Alleva, “ Rec- ognizing overlapped speech in meetings: A multichannel sep ara- tion approach using neural networks,” in Proc. INTERSPEECH, 2018, pp. 3038–3042

work page 2018

[19] [19]

The ﬁf th CHiME speech separation and recognition challenge: Datase t, task and baselines,

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The ﬁf th CHiME speech separation and recognition challenge: Datase t, task and baselines,” in Proc. INTERSPEECH , 2018, pp. 1561– 1565

work page 2018

[20] [20]

The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using mu lti- ple microphone arrays,

N. Kanda, R. Ikeshita, S. Horiguchi, Y . Fujita, K. Nagam atsu, X. Wang, V . Manohar, N. E. Y . Soplin, M. Maciejewski, S.-J. Chen et al. , “The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using mu lti- ple microphone arrays,” in Proc. CHiME-5, 2018, pp. 6–10

work page 2018

[21] [21]

Acoustic modeling for distant multi-talker s peech recognition with single- and multi-channel branches,

N. Kanda, Y . Fujita, S. Horiguchi, R. Ikeshita, K. Nagam atsu, and S. Watanabe, “Acoustic modeling for distant multi-talker s peech recognition with single- and multi-channel branches,” in Proc. ICASSP, 2019

work page 2019

[22] [22]

Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus- tering: Discriminative embeddings for segmentation and se para- tion,” in Proc. ICASSP, 2016, pp. 31–35

work page 2016

[23] [23]

Deep attractor netwo rk for single-microphone speaker separation,

Z. Chen, Y . Luo, and N. Mesgarani, “Deep attractor netwo rk for single-microphone speaker separation,” in Proc. ICASSP , 2017, pp. 246–250

work page 2017

[24] [24]

Recognizing multi-talker speech with permutation invariant training,

D. Y u, X. Chang, and Y . Qian, “Recognizing multi-talker speech with permutation invariant training,” in Proc. INTERSPEECH , 2017, pp. 2456–2460

work page 2017

[25] [25]

Progressive joint modeling in unsupervised sing le- channel overlapped speech recognition,

Z. Chen, J. Droppo, J. Li, W. Xiong, Z. Chen, J. Droppo, J. Li, and W. Xiong, “Progressive joint modeling in unsupervised sing le- channel overlapped speech recognition,” IEEE/ACM Trans. on ASLP, vol. 26, no. 1, pp. 184–196, 2018

work page 2018

[26] [26]

End-to-end multi-speaker speech recognition,

S. Settle, J. Le Roux, T. Hori, S. Watanabe, and J. R. Hers hey, “End-to-end multi-speaker speech recognition,” in Proc. ICASSP, 2018, pp. 4819–4823

work page 2018

[27] [27]

A purely end-to-end system for multi-speaker speech recogni tion,

H. Seki, T. Hori, S. Watanabe, J. Le Roux, and J. R. Hershe y, “A purely end-to-end system for multi-speaker speech recogni tion,” in Proc. ACL, 2018, pp. 2620–2630

work page 2018

[28] [28]

End-to-end monaural multi-speaker asr system without pretraining,

X. Chang, Y . Qian, K. Y u, and S. Watanabe, “End-to-end monaural multi-speaker asr system without pretraining,” i n Proc. ICASSP, 2019

work page 2019

[29] [29]

Permutation invari- ant training of deep models for speaker-independent multi- talker speech separation,

D. Y u, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invari- ant training of deep models for speaker-independent multi- talker speech separation,” in Proc. ICASSP, 2017, pp. 241–245

work page 2017

[30] [30]

Speaker-aware neural network based beam- former for speaker extraction in speech mixtures,

K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani, “Speaker-aware neural network based beam- former for speaker extraction in speech mixtures,” in Proc. IN- TERSPEECH, 2017

work page 2017

[31] [31]

Single channel target speaker extraction and recog- nition with speaker beam,

M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recog- nition with speaker beam,” in ICASSP, 2018, pp. 5554–5558

work page 2018

[32] [32]

Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,

D. Povey, V . Peddinti, D. Galvez, P . Ghahrmani, V . Manoh ar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,” inProc. INTER- SPEECH, 2016, pp. 2751–2755

work page 2016

[33] [33]

Single-channel multi-speaker separation using deep clus tering,

Y . Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershe y, “Single-channel multi-speaker separation using deep clus tering,” in Proc. INTERSPEECH, 2016, pp. 545–549

work page 2016

[34] [34]

Single-channel multi-tal ker speech recognition with permutation invariant training,

Y . Qian, X. Chang, and D. Y u, “Single-channel multi-tal ker speech recognition with permutation invariant training,” Speech Communication, vol. 104, pp. 1–11, 2018

work page 2018

[35] [35]

A time delay ne ural network architecture for efﬁcient modeling of long tempora l con- texts,

V . Peddinti, D. Povey, and S. Khudanpur, “A time delay ne ural network architecture for efﬁcient modeling of long tempora l con- texts,” in Proc. INTERSPEECH, 2015, pp. 3214–3218

work page 2015

[36] [36]

Investigation o f lattice- free maximum mutual information-based acoustic models wit h sequence-level Kullback-Leibler divergence,

N. Kanda, Y . Fujita, and K. Nagamatsu, “Investigation o f lattice- free maximum mutual information-based acoustic models wit h sequence-level Kullback-Leibler divergence,” in Proc. ASRU , 2017, pp. 69–76

work page 2017

[37] [37]

Sequence distil lation for purely sequence trained acoustic models,

N. Kanda, Y . Fujita, and K. Nagamatsu, “Sequence distil lation for purely sequence trained acoustic models,” in Proc. ICASSP, 2018, pp. 5964–5968

work page 2018

[38] [38]

Sequen ce- discriminative training of deep neural networks,

K. V esel` y, A. Ghoshal, L. Burget, and D. Povey, “Sequen ce- discriminative training of deep neural networks,” in Proc. INTER- SPEECH, 2013, pp. 2345–2349

work page 2013

[39] [39]

Error back propagation for se- quence training of context-dependent deep networks for con ver- sational speech transcription,

H. Su, G. Li, D. Y u, and F. Seide, “Error back propagation for se- quence training of context-dependent deep networks for con ver- sational speech transcription,” in Proc. ICASSP, 2013, pp. 6664– 6668

work page 2013

[40] [40]

Con- nectionist temporal classiﬁcation: labelling unsegmente d se- quence data with recurrent neural networks,

A. Graves, S. Fern´ andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classiﬁcation: labelling unsegmente d se- quence data with recurrent neural networks,” in Proc. ICML , 2006, pp. 369–376

work page 2006

[41] [41]

End-t o- end continuous speech recognition using attention-based recurrent NN: First results,

J. Chorowski, D. Bahdanau, K. Cho, and Y . Bengio, “End-t o- end continuous speech recognition using attention-based recurrent NN: First results,” Proc. NIPS workshop on Deep Learning and Representation Learning, 2014

work page 2014

[42] [42]

Listen, atten d and spell: A neural network for large vocabulary conversati onal speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, atten d and spell: A neural network for large vocabulary conversati onal speech recognition,” in Proc. ICASSP, 2016, pp. 4960–4964

work page 2016

[43] [43]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P . J. Kenny, R. Dehak, P . Dumouchel, and P . Ouel let, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans. on ASLP, vol. 19, no. 4, pp. 788–798, 2011

work page 2011

[44] [44]

Speaker adap- tation of neural network acoustic models using i-vectors,

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adap- tation of neural network acoustic models using i-vectors,” in Proc. ASRU, 2013, pp. 55–59

work page 2013

[45] [45]

Lib- rispeech: an ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015

work page 2015

[46] [46]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembe k, N. Goel, M. Hannemann, P . Motl´ ıˇ cek, Y . Qian, P . Schwarzet al., “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011

work page 2011

[47] [47]

Phoneme recognition using time-delay neural networks,

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. L ang, “Phoneme recognition using time-delay neural networks,” IEEE Trans. ASSP, vol. 37, no. 3, pp. 328–339, 1989

work page 1989

[48] [48]

Long short-term mem ory,

S. Hochreiter and J. Schmidhuber, “Long short-term mem ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[49] [49]

Backstitch: Counteracting ﬁnite-sample bias via neg ative steps,

Y . Wang, V . Peddinti, H. Xu, X. Zhang, D. Povey, and S. Khudan- pur, “Backstitch: Counteracting ﬁnite-sample bias via neg ative steps,” in Proc. INTERSPEECH, 2017, pp. 1631–1635

work page 2017