Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Bo Xu; Linhao Dong; Shiyu Zhou; Shuang Xu

arxiv: 1804.10752 · v2 · pith:RZAJAYWYnew · submitted 2018-04-28 · 📡 eess.AS · cs.CL· cs.SD

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Shiyu Zhou , Linhao Dong , Shuang Xu , Bo Xu This is my paper

classification 📡 eess.AS cs.CLcs.SD

keywords modeltransformersequence-to-sequenceattention-basedchineseci-phonememandarinsequences

0 comments

read the original abstract

Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and language model into a single neural network. In these models, the Transformer, a new sequence-to-sequence attention-based model relying entirely on self-attention without using RNNs or convolutions, achieves a new single-model state-of-the-art BLEU on neural machine translation (NMT) tasks. Since the outstanding performance of the Transformer, we extend it to speech and concentrate on it as the basic architecture of sequence-to-sequence attention-based model on Mandarin Chinese ASR tasks. Furthermore, we investigate a comparison between syllable based model and context-independent phoneme (CI-phoneme) based model with the Transformer in Mandarin Chinese. Additionally, a greedy cascading decoder with the Transformer is proposed for mapping CI-phoneme sequences and syllable sequences into word sequences. Experiments on HKUST datasets demonstrate that syllable based model with the Transformer performs better than CI-phoneme based counterpart, and achieves a character error rate (CER) of \emph{$28.77\%$}, which is competitive to the state-of-the-art CER of $28.0\%$ by the joint CTC-attention based encoder-decoder network.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NIESR: Nuisance Invariant End-to-end Speech Recognition
cs.CL 2019-07 unverdicted novelty 6.0

NIESR applies unsupervised adversarial invariance induction to end-to-end ASR, reporting 5.48-14.44% relative error reductions on WSJ0, CHiME3, and TIMIT without nuisance factor labels.
Root Mean Square Layer Normalization
cs.LG 2019-10 conditional novelty 5.0

RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.
Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition
eess.AS 2019-07 unverdicted novelty 4.0

Knowledge distillation from an external RNN language model to a seq2seq ASR model yields 9.3% CER on Chinese datasets, an 18.42% relative improvement over the baseline without test-time fusion components.