Listen, Attend and Spell
read the original abstract
We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.
This paper has not been read by Pith yet.
Forward citations
Cited by 9 Pith papers
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
In-context Learning and Induction Heads
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
-
Improving Speech Recognition of Named Entities in Classroom Speech with LLM Revision and Phonetic-Semantic Context
An LLM-based revision method with phonetic-semantic context reduces named entity word error rate by up to 30% relative on a new 45-hour MIT classroom speech dataset.
-
Cross-Attention End-to-End ASR for Two-Party Conversations
End-to-end ASR model with speaker-specific cross-attention for two-party conversations outperforms standard models on the Switchboard corpus.
-
NIESR: Nuisance Invariant End-to-end Speech Recognition
NIESR applies unsupervised adversarial invariance induction to end-to-end ASR, reporting 5.48-14.44% relative error reductions on WSJ0, CHiME3, and TIMIT without nuisance factor labels.
-
Self Multi-Head Attention for Speaker Recognition
Self multi-head attention applied after CNN encoding of spectrograms outperforms temporal and statistical pooling for speaker verification on VoxCeleb1 with 18% relative EER reduction.
-
StepAudio 2.5 Technical Report
StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
-
MedASR: An Open-Source Model for High-Accuracy Medical Dictation
MedASR is an open-source 105M-parameter ASR model achieving 58% relative WER reduction versus Whisper Large-v3 on medical dictation.
-
Hierarchical Sequence to Sequence Voice Conversion with Limited Data
Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.