Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford , Jong Wook Kim , Tao Xu , Greg Brockman , Christine McLeavey , Ilya Sutskever

Authors on Pith no claims yet

classification 📡 eess.AS cs.CLcs.LGcs.SD

keywords modelsspeechprocessingrobustsupervisionwhenaccuracyamounts

read the original abstract

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
cs.SD 2026-04 unverdicted novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Mechanistic Interpretability of ASR models using Sparse Autoencoders
cs.CL 2026-05 unverdicted novelty 7.0

Sparse autoencoders applied to Whisper ASR reveal monosemantic features across linguistic boundaries and demonstrate cross-lingual feature steering.
Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models
eess.AS 2026-05 unverdicted novelty 7.0

Derives a rigorous entropy minimization formulation for autoregressive test-time adaptation that decomposes into policy gradient and entropy terms, reinterpreting prior methods and improving Whisper ASR across 20+ domains.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
cs.CL 2026-04 accept novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
Tadabur: A Large-Scale Quran Audio Dataset
cs.SD 2026-04 unverdicted novelty 7.0

Tadabur is a large-scale Quran audio dataset with over 1400 hours from 600+ reciters to support speech research and benchmarks.
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
cs.CV 2026-04 unverdicted novelty 7.0

MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
Sign-to-Speech Prosody Transfer via Sign Reconstruction-based GAN
cs.SD 2026-04 unverdicted novelty 7.0

SignRecGAN trains on separate sign and speech datasets via adversarial and reconstruction objectives to inject sign-derived prosody into TTS output using the S2PFormer model.
In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
cs.CL 2026-04 unverdicted novelty 7.0

Speech language models show in-context learning where speaking rate affects both accuracy and mimicry, and induction heads are causally necessary for this capability.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
A Semi-Supervised Framework for Speech Confidence Detection using Whisper
cs.SD 2026-05 unverdicted novelty 6.0

A hybrid semi-supervised framework fusing Whisper embeddings with acoustic and prosodic features achieves 0.751 Macro-F1 for speaker confidence detection and outperforms baselines including WavLM, HuBERT, and Wav2Vec 2.0.
STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts
cs.SD 2026-05 unverdicted novelty 6.0

STRUM is a multi-stage neural audio-to-chart system that achieves F1 scores of 0.838 (drums), 0.694 (bass), 0.651 (guitar), and 0.539 (vocals) on a 30-song benchmark with released code and models.
Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control
cs.CL 2026-05 unverdicted novelty 6.0

A consequence-aware evaluation framework applied to LLMs in ATC finds peak Risk Score of only 0.69 despite high macro-F1, with errors concentrated in high-impact entities.
Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping
cs.LG 2026-05 unverdicted novelty 6.0

Imagined speech can be decoded from MEG by mapping imagined brain responses to listened ones and applying a word decoder trained only on listened data, yielding significant above-chance decoding for held-out subjects.
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
cs.AI 2026-05 unverdicted novelty 6.0

Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
BlasBench: An Open Benchmark for Irish Speech Recognition
cs.CL 2026-04 conditional novelty 6.0

BlasBench supplies an Irish-aware normalizer and scoring harness that enables reproducible ASR comparisons and exposes a 33-43 point generalization gap for fine-tuned models versus 7-10 points for massively multilingual ones.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
cs.CV 2026-04 unverdicted novelty 6.0

SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
eess.AS 2026-04 unverdicted novelty 6.0

TASU2 adds controllability over uncertainty and error rate to text-derived CTC simulation, enabling better cross-modal alignment and low-resource adaptation for speech LLMs than prior text-only or TTS methods.
Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative
cs.SD 2026-03 accept novelty 6.0

RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.
Predicting Psychological Well-Being from Spontaneous Speech using LLMs
cs.CL 2026-05 unverdicted novelty 5.0

LLMs achieve Spearman correlations up to 0.8 for zero-shot Ryff PWB prediction from spontaneous speech, with added statistical and linguistic explainability analyses.
AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State
cs.CV 2026-05 unverdicted novelty 5.0

AllocMV uses a global planner to build a structured persistent state then solves a Multiple-Choice Knapsack Problem to allocate High-Gen, Mid-Gen, and Reuse compute branches, achieving an optimal Cost-Quality Ratio un...
Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing
cs.SD 2026-04 unverdicted novelty 5.0

Few-shot TTS adaptation combined with LLM-guided phoneme editing produces synthetic accented speech that improves ASR word error rates on real accented audio even in cross-speaker and ultra-low-data settings.
WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition
cs.CL 2026-04 unverdicted novelty 5.0

WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.
LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People
cs.AI 2026-04 unverdicted novelty 5.0

A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube
cs.CL 2026-04 unverdicted novelty 5.0

LLMs annotated 100 YouTube transcripts on cow urine health claims using a 14-category taxonomy, revealing that promoters rely on efficacy appeals and social proof while debunkers emphasize authority and rebuttal.
When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation
cs.CL 2026-04 unverdicted novelty 5.0

LLMs fail to reliably detect culturally embedded health misinformation on YouTube because promotional and debunking content share similar rhetorical registers that blend tradition with pseudo-science, and this limitat...
Migrant Voices, Local News: Insights on Bridging Community Needs with Media Content
cs.CL 2026-04 unverdicted novelty 5.0

Focus groups reveal topic gaps and readability barriers in local news for migrants, uncovered by applying standard NLP tools to 2000+ hyper-local articles.
Towards General Text Embeddings with Multi-stage Contrastive Learning
cs.CL 2023-08 unverdicted novelty 5.0

GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
Voice Biomarkers for Depression and Anxiety
cs.LG 2026-05 unverdicted novelty 4.0

Deep learning models extract content-agnostic voice biomarkers for depression and anxiety from a ~65k-utterance proprietary dataset, achieving 71% sensitivity and specificity when combined with lexical features.
Detecting Alarming Student Verbal Responses using Text and Audio Classifier
cs.CL 2026-04 unverdicted novelty 4.0

A hybrid text-plus-audio classifier framework is introduced to identify potentially troubling student responses by analyzing both what is said and how it is said.
Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference
cs.AI 2026-04 unverdicted novelty 4.0

A quantized int4 version of Nemotron ASR runs faster than real-time on CPU at 8.20% WER and 0.67 GB size, setting a new efficiency point for on-device streaming speech recognition.
From Multimodal Signals to Adaptive XR Experiences for De-escalation Training
cs.HC 2026-04 unverdicted novelty 4.0

An early multimodal XR prototype fuses five signal streams with an interpretation layer to detect escalation cues and enable adaptive de-escalation training.
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing, Translation, and Sign Language Rendering
cs.CE 2026-04 unverdicted novelty 3.0

A modular XR platform integrates Whisper, NLLB, AWS Polly, RoBERTa, flan-t5, and MediaPipe to deliver real-time multilingual and International Sign support for education, with benchmarks showing AWS Polly's low latenc...