hub

Deep Speech: Scaling up end-to-end speech recognition

· 2014 · cs.CL · arXiv 1412.5567

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

open full Pith review browse 12 citing papers arXiv PDF

abstract

We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

Gauge-covariant stochastic neural fields: Stability and finite-width effects

hep-th · 2025-08-26 · unverdicted · novelty 7.0

A gauge-covariant stochastic neural field theory is introduced that derives the maximal Lyapunov exponent and amplification factor, showing finite-width effects as perturbative corrections to dressed kernels that leave the marginality condition unchanged for fixed kernel geometry.

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

Deep Learning Scaling is Predictable, Empirically

cs.LG · 2017-12-01 · unverdicted · novelty 7.0

Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.

Mixed Precision Training

cs.AI · 2017-10-10 · accept · novelty 7.0

Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.

Sink or SWIM: Tackling Real-Time ASR at Scale

cs.SD · 2026-01-22 · unverdicted · novelty 6.0

SWIM scales Whisper ASR to 20 concurrent multilingual clients via buffer merging, achieving ~2.4s delay at 5 clients versus 3.4s for single-client baselines while preserving accuracy.

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

cs.CL · 2024-02-20 · conditional · novelty 6.0

DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.

Fine-grained robust prosody transfer for single-speaker neural text-to-speech

eess.AS · 2019-07-04 · unverdicted · novelty 6.0

Decouples prosody alignment via pre-computed phoneme timestamps and adds VAE to achieve robust fine-grained prosody transfer in single-speaker neural TTS from unseen speakers.

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

cs.CL · 2019-06-27 · unverdicted · novelty 6.0

Gated fusion of fastText and BERT embeddings into an end-to-end ASR model captures multi-sentence conversational context and lowers word error rate on the Switchboard corpus.

SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.

Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

cs.CL · 2025-10-03 · conditional · novelty 4.0

Direct prompting scales more consistently than CoT prompting for speech-to-text translation as the amount of S2TT data increases.

Empowering Video Translation using Multimodal Large Language Models

cs.CV · 2026-04-13 · unverdicted · novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

cs.SD · 2023-09-22 · unverdicted · novelty 3.0

The authors propose and test a data augmentation framework based on deepfake audio to improve training of speech-to-text transcription models.

citing papers explorer

Showing 12 of 12 citing papers.

Gauge-covariant stochastic neural fields: Stability and finite-width effects hep-th · 2025-08-26 · unverdicted · none · ref 3 · internal anchor
A gauge-covariant stochastic neural field theory is introduced that derives the maximal Lyapunov exponent and amplification factor, showing finite-width effects as perturbative corrections to dressed kernels that leave the marginality condition unchanged for fixed kernel geometry.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation cs.CL · 2026-04-22 · unverdicted · none · ref 14
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
Deep Learning Scaling is Predictable, Empirically cs.LG · 2017-12-01 · unverdicted · none · ref 4
Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
Mixed Precision Training cs.AI · 2017-10-10 · accept · none · ref 8
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
Sink or SWIM: Tackling Real-Time ASR at Scale cs.SD · 2026-01-22 · unverdicted · none · ref 9 · internal anchor
SWIM scales Whisper ASR to 20 concurrent multilingual clients via buffer merging, achieving ~2.4s delay at 5 clients versus 3.4s for single-client baselines while preserving accuracy.
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive cs.CL · 2024-02-20 · conditional · none · ref 239 · internal anchor
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
Fine-grained robust prosody transfer for single-speaker neural text-to-speech eess.AS · 2019-07-04 · unverdicted · none · ref 28 · internal anchor
Decouples prosody alignment via pre-computed phoneme timestamps and adds VAE to achieve robust fine-grained prosody transfer in single-speaker neural TTS from unseen speakers.
Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion cs.CL · 2019-06-27 · unverdicted · none · ref 16 · internal anchor
Gated fusion of fastText and BERT embeddings into an end-to-end ASR model captures multi-sentence conversational context and lowers word error rate on the Switchboard corpus.
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation cs.CV · 2026-04-09 · unverdicted · none · ref 21
SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting? cs.CL · 2025-10-03 · conditional · none · ref 15 · internal anchor
Direct prompting scales more consistently than CoT prompting for speech-to-text translation as the amount of S2TT data increases.
Empowering Video Translation using Multimodal Large Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 152
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
Deepfake audio as a data augmentation technique for training automatic speech to text transcription models cs.SD · 2023-09-22 · unverdicted · none · ref 10 · internal anchor
The authors propose and test a data augmentation framework based on deepfake audio to improve training of speech-to-text transcription models.

Deep Speech: Scaling up end-to-end speech recognition

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer