Listen, Attend and Spell

William Chan , Navdeep Jaitly , Quoc V. Le , Oriol Vinyals

Authors on Pith no claims yet

classification 💻 cs.CL cs.LGcs.NEstat.ML

keywords modelnetworkcharactersachievesattendcomponentslanguagelearns

read the original abstract

We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
In-context Learning and Induction Heads
cs.LG 2022-09 unverdicted novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...