pith. machine review for the scientific record. sign in

arxiv: 1412.5567 · v2 · submitted 2014-12-17 · 💻 cs.CL · cs.LG· cs.NE

Recognition: unknown

Deep Speech: Scaling up end-to-end speech recognition

Authors on Pith no claims yet
classification 💻 cs.CL cs.LGcs.NE
keywords speechdeepsystemsystemsdataend-to-endenvironmentsneed
0
0 comments X
read the original abstract

We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

  2. Deep Learning Scaling is Predictable, Empirically

    cs.LG 2017-12 unverdicted novelty 7.0

    Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.

  3. Mixed Precision Training

    cs.AI 2017-10 accept novelty 7.0

    Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.

  4. SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.

  5. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.