End-to-End ASR for Code-switched Hindi-English Speech

Basil Abraham; Brij Mohan Lal Srivastava; Preethi Jyothi; Rupesh Mehta; Sunayana Sitaram

arxiv: 1906.09426 · v1 · pith:BWJERMYEnew · submitted 2019-06-22 · 📡 eess.AS · cs.CL· cs.LG· cs.SD

End-to-End ASR for Code-switched Hindi-English Speech

Brij Mohan Lal Srivastava , Basil Abraham , Sunayana Sitaram , Rupesh Mehta , Preethi Jyothi This is my paper

Pith reviewed 2026-05-25 18:00 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.LGcs.SD

keywords end-to-end ASRcode-switched speechHindi-Englishmulti-task learninglow-resource ASRcorpus balancinggrapheme imbalance

0 comments

The pith

End-to-end ASR for Hindi-English code-switched speech improves with multi-task learning and corpus balancing under 50 hours of data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether end-to-end models can handle code-switched Hindi-English speech when training data is scarce. It tests two adjustments: multi-task learning to provide extra training signals and explicit balancing of the corpus to correct skewed grapheme frequencies. These steps are compared against conventional cascaded ASR pipelines. The central claim is that data scarcity hurts end-to-end performance but the two adjustments produce measurable gains. Readers interested in low-resource multilingual speech systems would therefore see a concrete route for making end-to-end training viable in similar settings.

Core claim

While insufficient data harms end-to-end ASR accuracy on code-switched Hindi-English speech, multi-task learning and corpus balancing deliver promising improvements that narrow the gap with traditional cascaded systems.

What carries the argument

Multi-task learning plus corpus balancing to counteract grapheme class imbalance during end-to-end training on limited code-switched data.

If this is right

End-to-end models become competitive with cascaded pipelines for code-switched recognition once data imbalance is addressed.
Corpus balancing directly reduces the effect of rare graphemes on overall word error rate.
Multi-task learning supplies auxiliary objectives that stabilize training when acoustic data is limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same balancing and multi-task approach could be tested on other code-switched pairs that also lack large corpora.
If the gains hold, they suggest that explicit frequency correction may be more important than simply collecting more raw hours for this domain.
Detailed per-grapheme error breakdowns in the full paper would clarify whether the balancing step truly equalizes performance across frequent and rare units.

Load-bearing premise

The observed gains come from the multi-task learning and balancing steps rather than from unstated choices in data splits, model hyperparameters, or random variation.

What would settle it

Re-running the exact training setups on the same corpus with multiple random seeds and finding that the performance difference between the adjusted end-to-end models and the baseline disappears or reverses.

Figures

Figures reproduced from arXiv: 1906.09426 by Basil Abraham, Brij Mohan Lal Srivastava, Preethi Jyothi, Rupesh Mehta, Sunayana Sitaram.

**Figure 2.** Figure 2: Training set character distribution for code-switched Hindi-English. The y-axis plots the frequency of the corresponding character on the x-axis. Some prominent bars are annotated with the characters they represent. different types of attention: vanilla, location aware and windowed. We also regularize the network by adding Gaussian noise of standard deviation 0.01 to the input and a dropout rate of 0.5 o… view at source ↗

**Figure 3.** Figure 3: Character confusion matrices for Hindi-English CS data and different MTL λ values. The dotted green region indicates confusion for Hindi characters and the dotted red region shows confusion for English characters. both English and Hindi. λ = 0.7, in comparison, performs very well. Also note in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

End-to-end (E2E) models have been explored for large speech corpora and have been found to match or outperform traditional pipeline-based systems in some languages. However, most prior work on end-to-end models use speech corpora exceeding hundreds or thousands of hours. In this study, we explore end-to-end models for code-switched Hindi-English language with less than 50 hours of data. We utilize two specific measures to improve network performance in the low-resource setting, namely multi-task learning (MTL) and balancing the corpus to deal with the inherent class imbalance problem i.e. the skewed frequency distribution over graphemes. We compare the results of the proposed approaches with traditional, cascaded ASR systems. While the lack of data adversely affects the performance of end-to-end models, we see promising improvements with MTL and balancing the corpus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Modest but consistent gains from MTL and balancing on small-scale Hindi-English code-switched ASR.

read the letter

The central point is that this paper gets usable improvements on end-to-end models for code-switched Hindi-English ASR with very limited data by using multi-task learning and balancing the corpus for class imbalance. The results hold up in the reported tables. The work applies these techniques to a setting where prior E2E work has not focused much, given the data constraints. They provide comparisons against cascaded ASR systems and include enough training details to see how the models were set up. The full manuscript shows the WER numbers are consistent with the described methods, and there are no obvious flaws in how the data was handled or split. What the paper does well is acknowledge the difficulty of low data for E2E systems and demonstrate practical steps that help without overclaiming. The balancing addresses the skewed grapheme distribution directly, which is a reasonable choice for this task. The soft spots are mostly around evaluation strength. There are no error bars or multiple random seeds reported, so we cannot tell how stable the gains are across different initializations. The absolute improvements are modest, and with less than 50 hours the risk of overfitting or high variance is real. The scope stays narrow to this one language pair, so broader implications for other code-switched languages are not explored. This paper is aimed at people building speech systems for bilingual populations where data collection is expensive. A reader interested in low-resource ASR techniques would get value from the specific numbers and setup. It deserves peer review. The experiments are reproducible from the details and the argument is straightforward, even if it needs more statistical support in revision.

Referee Report

1 major / 2 minor

Summary. The paper explores end-to-end (E2E) ASR for code-switched Hindi-English speech using under 50 hours of data. It applies multi-task learning (MTL) and corpus balancing to address data scarcity and grapheme imbalance, then compares performance against traditional cascaded pipeline systems, concluding that MTL and balancing yield promising improvements despite the low-resource constraint.

Significance. If the reported WER gains hold under the described conditions, the work provides a concrete data point on adapting E2E models to low-resource code-switched settings, an area of practical importance. The internal consistency of the training details, baseline comparisons, and WER tables noted in the full manuscript strengthens the empirical argument within its scope.

major comments (1)

[Results] Results section (WER tables): the central claim of 'promising improvements' rests on reported deltas, yet no error bars, multiple random seeds, or statistical significance tests are mentioned; this leaves open whether the gains exceed run-to-run variance and weakens external generalization claims.

minor comments (2)

[Abstract] Abstract: quantitative WER numbers and baseline details are absent, forcing readers to reach the full text for the actual evidence.
[Experimental Setup] Experimental setup: clearer description of how the MTL auxiliary task is weighted and how the corpus balancing is implemented (e.g., exact oversampling ratios) would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the statistical robustness of our empirical results. We address the single major comment below.

read point-by-point responses

Referee: [Results] Results section (WER tables): the central claim of 'promising improvements' rests on reported deltas, yet no error bars, multiple random seeds, or statistical significance tests are mentioned; this leaves open whether the gains exceed run-to-run variance and weakens external generalization claims.

Authors: We agree that the lack of error bars, multiple seeds, and significance testing is a limitation that weakens confidence in the reported deltas. All experiments in the manuscript were run once with a fixed random seed owing to the high computational cost of E2E training even on <50 h of data. In the revised manuscript we will rerun the primary configurations (baseline, MTL, and balanced) with at least three different seeds, report mean WER together with standard deviation in the tables, and apply a paired statistical test (e.g., McNemar or bootstrap) on the test-set hypotheses to establish whether the observed gains exceed run-to-run variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical paper reporting ASR experiments on <50h code-switched Hindi-English data. It contains no equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations. All claims rest on direct experimental comparisons (WER tables, MTL vs. baseline, corpus balancing) that are internally consistent and falsifiable against the reported data splits and metrics. No step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard neural ASR modeling assumptions and does not introduce new free parameters, axioms, or entities beyond routine application of existing techniques.

axioms (1)

domain assumption Standard neural network optimization and data preprocessing assumptions hold for this low-resource ASR task
Invoked implicitly when applying MTL and balancing without stated deviations.

pith-pipeline@v0.9.0 · 5692 in / 1056 out tokens · 22281 ms · 2026-05-25T18:00:26.054957+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 14 internal anchors

[1]

Introduction Unlike the traditional automatic speech recognition (ASR) pipeline where several independently optimized modules are in- tegrated (typically using weighted ﬁnite state transducers), end- to-end (E2E) architectures instead optimize a single neural net- work that maps acoustic events to grapheme sequences. Re- cently, there has been a substanti...

work page
[2]

End-to-End ASR for Code-switched Hindi-English Speech

Relation to prior work Graves et al. [8] ﬁrst proposed the CTC loss function to predict the underlying sequence of phonemes in a speech signal without using any a priori phone alignment information. They achieved around 30% label error rate on the TIMIT corpus. Graves et al. [5] proposed an end-to-end model that yielded a 27% WER on the Wall Street Journa...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[3]

Our approach In this section, we describe our two main approaches for the low-resource code-switched setting. 3.1. Multi-task learning We make use of the multi-task learning framework outlined by [7]. In this approach, we jointly optimize a convex com- bination of two loss functions: CTC and attention. CTC loss is deﬁned as the negative log likelihood of ...

work page
[4]

Data Table 1 describes the conversational Hindi-English data used in our study; more details can be found in [23]

Experiments 4.1. Data Table 1 describes the conversational Hindi-English data used in our study; more details can be found in [23]. The grapheme set includes⟨space⟩,⟨sos⟩,⟨eos⟩ and⟨unk⟩ identiﬁers for spaces, start-of-sentence, end-of-sentence and unknown characters. 4.2. Baselines We compare our approaches against baselines built using the traditional AS...

work page
[5]

We explore multi-task learning by combining CTC and attention loss and notice that there is a certain range of the combination parameter λ which produces robust performance

Conclusion & Future directions In this work, we investigate two approaches for end-to-end ASR of code-switched speech in a low-resource setting. We explore multi-task learning by combining CTC and attention loss and notice that there is a certain range of the combination parameter λ which produces robust performance. We also notice that most of the errors...

work page
[6]

Improving attention based sequence-tosequence models for end- to-end english conversational speech recognition,

C. Weng, J. Cui, G. Wang, J. Wang, C. Yu, D. Su, and D. Yu, “Improving attention based sequence-tosequence models for end- to-end english conversational speech recognition,” Proc. Inter- speech, Hyderabad, India, pp. 761–765, 2018

work page 2018
[7]

End-to-End Speech Recognition From the Raw Waveform

N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux, “End-to-end speech recognition from the raw wave- form,” arXiv preprint arXiv:1806.07098, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Improved training of end-to-end attention models for speech recognition,

A. Zeyer, K. Irie, R. Schl ¨uter, and H. Ney, “Improved training of end-to-end attention models for speech recognition,” arXiv preprint arXiv:1805.03294, 2018

work page arXiv 2018
[9]

Multilingual Speech Recognition With A Single End-To-End Model

S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. We- instein, and K. Rao, “Multilingual speech recognition with a sin- gle end-to-end model,” arXiv preprint arXiv:1711.01694, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Towards end-to-end speech recognition with recurrent neural networks,

A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International Conference on Machine Learning, 2014, pp. 1764–1772

work page 2014
[11]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4960–4964

work page 2016
[12]

Joint ctc-attention based end- to-end speech recognition using multi-task learning,

S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end- to-end speech recognition using multi-task learning,” in Acous- tics, Speech and Signal Processing (ICASSP), 2017 IEEE Inter- national Conference on. IEEE, 2017, pp. 4835–4839

work page 2017
[13]

Con- nectionist temporal classiﬁcation: labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classiﬁcation: labelling unsegmented se- quence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning . ACM, 2006, pp. 369–376

work page 2006
[14]

Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding,

Y . Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding,” in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015, pp. 167–174

work page 2015
[15]

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-to-end convnet-based speech recognition system,” CoRR, vol. abs/1609.03193, 2016. [Online]. Available: http://arxiv.org/ abs/1609.03193

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

Letter-Based Speech Recognition with Gated ConvNets

V . Liptchinsky, G. Synnaeve, and R. Collobert, “Letter- based speech recognition with gated convnets,” CoRR, vol. abs/1712.09444, 2017. [Online]. Available: http://arxiv.org/abs/ 1712.09444

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Very deep convolutional net- works for end-to-end speech recognition,

Y . Zhang, W. Chan, and N. Jaitly, “Very deep convolutional net- works for end-to-end speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Con- ference on. IEEE, 2017, pp. 4845–4849

work page 2017
[18]

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Y . Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. L. Y . Ben- gio, and A. Courville, “Towards end-to-end speech recogni- tion with deep convolutional neural networks,” arXiv preprint arXiv:1701.02720, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results

J. Chorowski, D. Bahdanau, K. Cho, and Y . Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: First results,” arXiv preprint arXiv:1412.1602, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[20]

Attention-based models for speech recognition,

J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Ben- gio, “Attention-based models for speech recognition,” in Ad- vances in neural information processing systems, 2015, pp. 577– 585

work page 2015
[21]

End-to-end attention-based large vocabulary speech recog- nition,

D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y . Ben- gio, “End-to-end attention-based large vocabulary speech recog- nition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4945– 4949

work page 2016
[22]

Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model

B. Li, T. N. Sainath, K. C. Sim, M. Bacchiani, E. Weinstein, P. Nguyen, Z. Chen, Y . Wu, and K. Rao, “Multi-dialect speech recognition with a single sequence-to-sequence model,” arXiv preprint arXiv:1712.01541, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Goninaet al., “State- of-the-art speech recognition with sequence-to-sequence models,” arXiv preprint arXiv:1712.01769, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Towards End-to-end Automatic Code-Switching Speech Recognition

G. I. Winata, A. Madotto, C.-S. Wu, and P. Fung, “Towards end-to-end automatic code-switching speech recognition,” arXiv preprint arXiv:1810.12620, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Towards End-to-End Code-Switching Speech Recognition

N. Luo, D. Jiang, S. Zhao, C. Gong, W. Zou, and X. Li, “Towards end-to-end code-switching speech recognition,” arXiv preprint arXiv:1810.13091, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Training cost-sensitive neural net- works with methods addressing the class imbalance problem,

Z.-H. Zhou and X.-Y . Liu, “Training cost-sensitive neural net- works with methods addressing the class imbalance problem,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 63–77, 2006

work page 2006
[27]

A systematic study of the class imbalance problem in convolutional neural networks

M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of the class imbalance problem in convolutional neural networks,” arXiv preprint arXiv:1710.05381, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Phone merging for code-switched speech recog- nition,

S. Sivasankaran, B. M. L. Srivastava, S. Sitaram, K. Bali, and M. Choudhury, “Phone merging for code-switched speech recog- nition,” in Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, 2018, pp. 11–19

work page 2018
[29]

The kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding , no. EPFL- CONF-192584. IEEE Signal Processing Society, 2011

work page 2011
[30]

ESPnet: End-to-End Speech Processing Toolkit

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen et al. , “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

ADADELTA: An Adaptive Learning Rate Method

M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[32]

Hy- brid ctc/attention architecture for end-to-end speech recognition,

S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy- brid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing , vol. 11, no. 8, pp. 1240–1253, 2017

work page 2017

[1] [1]

Introduction Unlike the traditional automatic speech recognition (ASR) pipeline where several independently optimized modules are in- tegrated (typically using weighted ﬁnite state transducers), end- to-end (E2E) architectures instead optimize a single neural net- work that maps acoustic events to grapheme sequences. Re- cently, there has been a substanti...

work page

[2] [2]

End-to-End ASR for Code-switched Hindi-English Speech

Relation to prior work Graves et al. [8] ﬁrst proposed the CTC loss function to predict the underlying sequence of phonemes in a speech signal without using any a priori phone alignment information. They achieved around 30% label error rate on the TIMIT corpus. Graves et al. [5] proposed an end-to-end model that yielded a 27% WER on the Wall Street Journa...

work page internal anchor Pith review Pith/arXiv arXiv 1906

[3] [3]

Our approach In this section, we describe our two main approaches for the low-resource code-switched setting. 3.1. Multi-task learning We make use of the multi-task learning framework outlined by [7]. In this approach, we jointly optimize a convex com- bination of two loss functions: CTC and attention. CTC loss is deﬁned as the negative log likelihood of ...

work page

[4] [4]

Data Table 1 describes the conversational Hindi-English data used in our study; more details can be found in [23]

Experiments 4.1. Data Table 1 describes the conversational Hindi-English data used in our study; more details can be found in [23]. The grapheme set includes⟨space⟩,⟨sos⟩,⟨eos⟩ and⟨unk⟩ identiﬁers for spaces, start-of-sentence, end-of-sentence and unknown characters. 4.2. Baselines We compare our approaches against baselines built using the traditional AS...

work page

[5] [5]

We explore multi-task learning by combining CTC and attention loss and notice that there is a certain range of the combination parameter λ which produces robust performance

Conclusion & Future directions In this work, we investigate two approaches for end-to-end ASR of code-switched speech in a low-resource setting. We explore multi-task learning by combining CTC and attention loss and notice that there is a certain range of the combination parameter λ which produces robust performance. We also notice that most of the errors...

work page

[6] [6]

Improving attention based sequence-tosequence models for end- to-end english conversational speech recognition,

C. Weng, J. Cui, G. Wang, J. Wang, C. Yu, D. Su, and D. Yu, “Improving attention based sequence-tosequence models for end- to-end english conversational speech recognition,” Proc. Inter- speech, Hyderabad, India, pp. 761–765, 2018

work page 2018

[7] [7]

End-to-End Speech Recognition From the Raw Waveform

N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux, “End-to-end speech recognition from the raw wave- form,” arXiv preprint arXiv:1806.07098, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Improved training of end-to-end attention models for speech recognition,

A. Zeyer, K. Irie, R. Schl ¨uter, and H. Ney, “Improved training of end-to-end attention models for speech recognition,” arXiv preprint arXiv:1805.03294, 2018

work page arXiv 2018

[9] [9]

Multilingual Speech Recognition With A Single End-To-End Model

S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. We- instein, and K. Rao, “Multilingual speech recognition with a sin- gle end-to-end model,” arXiv preprint arXiv:1711.01694, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Towards end-to-end speech recognition with recurrent neural networks,

A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International Conference on Machine Learning, 2014, pp. 1764–1772

work page 2014

[11] [11]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4960–4964

work page 2016

[12] [12]

Joint ctc-attention based end- to-end speech recognition using multi-task learning,

S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end- to-end speech recognition using multi-task learning,” in Acous- tics, Speech and Signal Processing (ICASSP), 2017 IEEE Inter- national Conference on. IEEE, 2017, pp. 4835–4839

work page 2017

[13] [13]

Con- nectionist temporal classiﬁcation: labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classiﬁcation: labelling unsegmented se- quence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning . ACM, 2006, pp. 369–376

work page 2006

[14] [14]

Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding,

Y . Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding,” in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015, pp. 167–174

work page 2015

[15] [15]

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-to-end convnet-based speech recognition system,” CoRR, vol. abs/1609.03193, 2016. [Online]. Available: http://arxiv.org/ abs/1609.03193

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

Letter-Based Speech Recognition with Gated ConvNets

V . Liptchinsky, G. Synnaeve, and R. Collobert, “Letter- based speech recognition with gated convnets,” CoRR, vol. abs/1712.09444, 2017. [Online]. Available: http://arxiv.org/abs/ 1712.09444

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Very deep convolutional net- works for end-to-end speech recognition,

Y . Zhang, W. Chan, and N. Jaitly, “Very deep convolutional net- works for end-to-end speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Con- ference on. IEEE, 2017, pp. 4845–4849

work page 2017

[18] [18]

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Y . Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. L. Y . Ben- gio, and A. Courville, “Towards end-to-end speech recogni- tion with deep convolutional neural networks,” arXiv preprint arXiv:1701.02720, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results

J. Chorowski, D. Bahdanau, K. Cho, and Y . Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: First results,” arXiv preprint arXiv:1412.1602, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[20] [20]

Attention-based models for speech recognition,

J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Ben- gio, “Attention-based models for speech recognition,” in Ad- vances in neural information processing systems, 2015, pp. 577– 585

work page 2015

[21] [21]

End-to-end attention-based large vocabulary speech recog- nition,

D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y . Ben- gio, “End-to-end attention-based large vocabulary speech recog- nition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4945– 4949

work page 2016

[22] [22]

Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model

B. Li, T. N. Sainath, K. C. Sim, M. Bacchiani, E. Weinstein, P. Nguyen, Z. Chen, Y . Wu, and K. Rao, “Multi-dialect speech recognition with a single sequence-to-sequence model,” arXiv preprint arXiv:1712.01541, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Goninaet al., “State- of-the-art speech recognition with sequence-to-sequence models,” arXiv preprint arXiv:1712.01769, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Towards End-to-end Automatic Code-Switching Speech Recognition

G. I. Winata, A. Madotto, C.-S. Wu, and P. Fung, “Towards end-to-end automatic code-switching speech recognition,” arXiv preprint arXiv:1810.12620, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Towards End-to-End Code-Switching Speech Recognition

N. Luo, D. Jiang, S. Zhao, C. Gong, W. Zou, and X. Li, “Towards end-to-end code-switching speech recognition,” arXiv preprint arXiv:1810.13091, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Training cost-sensitive neural net- works with methods addressing the class imbalance problem,

Z.-H. Zhou and X.-Y . Liu, “Training cost-sensitive neural net- works with methods addressing the class imbalance problem,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 63–77, 2006

work page 2006

[27] [27]

A systematic study of the class imbalance problem in convolutional neural networks

M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of the class imbalance problem in convolutional neural networks,” arXiv preprint arXiv:1710.05381, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Phone merging for code-switched speech recog- nition,

S. Sivasankaran, B. M. L. Srivastava, S. Sitaram, K. Bali, and M. Choudhury, “Phone merging for code-switched speech recog- nition,” in Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, 2018, pp. 11–19

work page 2018

[29] [29]

The kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding , no. EPFL- CONF-192584. IEEE Signal Processing Society, 2011

work page 2011

[30] [30]

ESPnet: End-to-End Speech Processing Toolkit

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen et al. , “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

ADADELTA: An Adaptive Learning Rate Method

M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[32] [32]

Hy- brid ctc/attention architecture for end-to-end speech recognition,

S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy- brid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing , vol. 11, no. 8, pp. 1240–1253, 2017

work page 2017