Deep Speech: Scaling up end-to-end speech recognition
read the original abstract
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
This paper has not been read by Pith yet.
Forward citations
Cited by 11 Pith papers
-
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
-
Gauge-covariant stochastic neural fields: Stability and finite-width effects
A gauge-covariant stochastic neural field theory is introduced that derives the maximal Lyapunov exponent and amplification factor, showing finite-width effects as perturbative corrections to dressed kernels that leav...
-
Deep Learning Scaling is Predictable, Empirically
Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
-
Mixed Precision Training
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
-
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
-
Sink or SWIM: Tackling Real-Time ASR at Scale
SWIM scales Whisper ASR to 20 concurrent multilingual clients via buffer merging, achieving ~2.4s delay at 5 clients versus 3.4s for single-client baselines while preserving accuracy.
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
-
Fine-grained robust prosody transfer for single-speaker neural text-to-speech
Decouples prosody alignment via pre-computed phoneme timestamps and adds VAE to achieve robust fine-grained prosody transfer in single-speaker neural TTS from unseen speakers.
-
Empowering Video Translation using Multimodal Large Language Models
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
-
Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?
Direct prompting scales more consistently than CoT prompting for speech-to-text translation as the amount of S2TT data increases.
-
Deepfake audio as a data augmentation technique for training automatic speech to text transcription models
The authors propose and test a data augmentation framework based on deepfake audio to improve training of speech-to-text transcription models.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.