Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford; Christine McLeavey; Greg Brockman; Ilya Sutskever; Jong Wook Kim; Tao Xu

arxiv: 2212.04356 · v1 · submitted 2022-12-06 · 📡 eess.AS · cs.CL· cs.LG· cs.SD

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford , Jong Wook Kim , Tao Xu , Greg Brockman , Christine McLeavey , Ilya Sutskever This is my paper

Pith reviewed 2026-05-24 10:11 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.LGcs.SD

keywords speech recognitionweak supervisionzero-shot transfermultilingual audiolarge-scale trainingrobustnessaudio transcription

0 comments

The pith

Speech models trained on 680,000 hours of web audio transcripts achieve competitive zero-shot recognition on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether predicting transcripts from large volumes of internet audio can produce capable speech systems. Scaling this approach to 680,000 hours of multilingual and multitask data yields models that transfer directly to held-out benchmarks without fine-tuning. These models reach accuracy levels often matching prior supervised systems and approach human performance on robustness measures. The work releases the models and code as a base for additional speech processing research.

Core claim

When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness.

What carries the argument

Large-scale weak supervision via transcript prediction on web-scraped audio pairs, which supplies the training signal for generalization.

If this is right

A single model can handle multiple languages and tasks through the same training process.
Zero-shot evaluation on new benchmarks becomes feasible without task-specific retraining.
Model releases enable downstream work on robust speech systems starting from the web-trained weights.
Robustness metrics approach human levels when scale is applied to diverse web data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data scaling pattern could reduce the need for curated labels in related sequence tasks.
Extending the approach to noisier or domain-shifted audio would test how far web data alone carries performance.
Combining the trained models with text-only systems might produce unified handling of speech and language inputs.

Load-bearing premise

The web-scraped audio-transcript pairs contain sufficient signal and diversity that a single model trained on them will generalize to held-out clean benchmarks without domain-specific filtering or post-hoc data selection.

What would settle it

If a model trained on the full 680,000 hours shows accuracy on clean benchmarks that falls substantially below established supervised baselines, the generalization claim would not hold.

Figures

Figures reproduced from arXiv: 2212.04356 by Alec Radford, Christine McLeavey, Greg Brockman, Ilya Sutskever, Jong Wook Kim, Tao Xu.

**Figure 1.** Figure 1: Overview of our approach. A sequence-to-sequence Transformer model is trained on many different speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of … view at source ↗

**Figure 2.** Figure 2: Zero-shot Whisper models close the gap to human robustness. Despite matching or outperforming a human on LibriSpeech dev-clean, supervised LibriSpeech models make roughly twice as many errors as a human on other datasets demonstrating their brittleness and lack of robustness. The estimated robustness frontier of zero-shot Whisper models, however, includes the 95% confidence interval for this particular hu… view at source ↗

**Figure 4.** Figure 4: Correlation of pre-training supervision amount with downstream translation performance. The amount of pretraining translation data for a given language is only moderately predictive of Whisper’s zero-shot performance on that language in Fleurs. the Indo-European language family and many of which are high-resource languages. These benchmarks only provide limited coverage and room to study Whisper models mu… view at source ↗

**Figure 5.** Figure 5: WER on LibriSpeech test-clean as a function of SNR under additive white noise (left) and pub noise (right). The accuracy of LibriSpeech-trained models degrade faster than the best Whisper model (F). NVIDIA STT models (•) perform best under low noise but are outperformed by Whisper under high noise (SNR < 10 dB). The second-best model under low noise (H) is fine-tuned on LibriSpeech only and degrades even m… view at source ↗

**Figure 6.** Figure 6: Whisper is competitive with state-of-the-art commercial and open-source ASR systems in long-form transcription. The distribution of word error rates from six ASR systems on seven long-form datasets are compared, where the input lengths range from a few minutes to a few hours. The boxes show the quartiles of per-example WERs, and the per-dataset aggregate WERs are annotated on each box. Our model outperform… view at source ↗

**Figure 7.** Figure 7: Whisper’s performance is close to that of professional human transcribers. This plot shows the WER distributions of 25 recordings from the Kincaid46 dataset transcribed by Whisper, the same 4 commercial ASR systems from [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Zero-shot Whisper performance scales reliably across tasks and languages with increasing model size. Lightly shaded lines represent individual datasets or languages, showing that performance is more varied than the smooth trends in aggregate performance. Large V2 distinguished with a dashed orange line since it includes several changes that are not present for the smaller models in this analysis. Dataset E… view at source ↗

**Figure 10.** Figure 10: , we visualize the differences. On most datasets the two normalizers perform similarly, without significant differences in WER reduction between Whisper and compared open-source models, while on some datasets, namely WSJ, CallHome, and Switchboard, our normalizer reduces the WER of Whisper models’ significantly more. The differences in reduction can be traced down to different formats used by the ground… view at source ↗

**Figure 11.** Figure 11: Training dataset statistics [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

read the original abstract

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that 680k hours of web-scraped weak supervision produces zero-shot ASR models competitive with supervised baselines on public benchmarks, with the release making the result directly checkable.

read the letter

The core result is that a single model trained on 680k hours of multilingual web audio with transcripts generalizes to standard ASR benchmarks without any fine-tuning and often matches prior supervised numbers. The scale plus the multitask multilingual setup is what is new here compared to earlier weak-supervision work in speech. Releasing the models and inference code is the practical part that lets others verify the numbers on the usual test sets. The empirical tables line up with the claim, and there is no circularity because the evaluation sets were never used in training or tuning. The main soft spot is that the exact filtering rules for the web data are described at a high level, so it is hard to know how much domain selection went into the training distribution. That said, the released models let anyone test whether the generalization holds on new data, which reduces the practical impact of the missing details. Minor gaps in the write-up on data provenance do not touch the central scaling observation. This is for groups working on speech recognition, low-resource languages, or data-efficient training; anyone who needs a strong starting point for robust ASR will get immediate value from the artifacts. It is worth sending to referees because the result is reproducible from the released code and changes the cost picture for building these systems.

Referee Report

0 major / 3 minor

Summary. The manuscript presents an empirical study of speech recognition models trained via large-scale weak supervision on 680,000 hours of multilingual and multitask web-scraped audio-transcript pairs. The central claim is that scaling this supervision produces models that achieve strong zero-shot generalization to standard public benchmarks, often matching or approaching prior fully supervised results without any fine-tuning, while also approaching human-level accuracy and robustness; models and inference code are released.

Significance. If the reported zero-shot results hold, the work establishes that weak supervision at this scale can yield robust speech systems competitive with supervised baselines, reducing reliance on task-specific labeled data. The direct verifiability of results on external benchmarks and the public release of models constitute clear strengths that support further research on scaling in speech processing.

minor comments (3)

[Abstract] Abstract and §2 (data collection): limited detail is provided on the precise filtering rules and selection criteria applied to the web-scraped pairs; while not required to verify the benchmark numbers, this reduces the ability to diagnose potential domain biases.
[Table 1] Table 1 and §4 (benchmark results): the zero-shot vs. supervised comparisons would benefit from explicit reporting of the number of runs or confidence intervals to quantify variability.
[§3] §3 (model architecture): the multitask training objective is described at a high level; a short equation or pseudocode would clarify how transcription, translation, and language identification losses are combined.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review, recognition of the work's significance, and recommendation to accept.

Circularity Check

0 steps flagged

No significant circularity; purely empirical scaling study on external benchmarks

full rationale

The paper reports results from training models on 680k hours of web-scraped weak supervision and evaluates zero-shot transfer on standard held-out benchmarks (e.g., LibriSpeech, Common Voice) that were never part of training or hyperparameter selection. No mathematical derivation chain exists; there are no equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes. Central claims rest on direct, externally verifiable metrics rather than any self-referential reduction. Minor self-citations to prior OpenAI scaling work are present but not load-bearing for the reported generalization results. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that noisy web transcripts supply adequate training signal at scale; no free parameters are fitted to the reported benchmark numbers and no new entities are postulated.

axioms (1)

domain assumption Internet audio-transcript pairs contain usable supervision despite label noise and domain mismatch
Invoked to justify training directly on web data without manual curation or strong filtering.

pith-pipeline@v0.9.0 · 5625 in / 985 out tokens · 18024 ms · 2026-05-24T10:11:23.192260+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

zero-shot transfer setting without the need for any fine-tuning
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Figure 1. Overview of our approach... multitask training format

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
cs.SD 2026-04 unverdicted novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
cs.CL 2023-05 unverdicted novelty 8.0

ciwGAN and fiwGAN models trained on isolated words spontaneously generate concatenated multi-word outputs and display early compositionality precursors.
Mechanistic Interpretability of ASR models using Sparse Autoencoders
cs.CL 2026-05 unverdicted novelty 7.0

Sparse autoencoders applied to Whisper ASR reveal monosemantic features across linguistic boundaries and demonstrate cross-lingual feature steering.
Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models
eess.AS 2026-05 unverdicted novelty 7.0

Derives a rigorous entropy minimization formulation for autoregressive test-time adaptation that decomposes into policy gradient and entropy terms, reinterpreting prior methods and improving Whisper ASR across 20+ domains.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
cs.CL 2026-04 accept novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
Tadabur: A Large-Scale Quran Audio Dataset
cs.SD 2026-04 unverdicted novelty 7.0

Tadabur is a large-scale Quran audio dataset with over 1400 hours from 600+ reciters to support speech research and benchmarks.
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
cs.CV 2026-04 unverdicted novelty 7.0

MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
Sign-to-Speech Prosody Transfer via Sign Reconstruction-based GAN
cs.SD 2026-04 unverdicted novelty 7.0

SignRecGAN trains on separate sign and speech datasets via adversarial and reconstruction objectives to inject sign-derived prosody into TTS output using the S2PFormer model.
In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
cs.CL 2026-04 unverdicted novelty 7.0

Speech language models show in-context learning where speaking rate affects both accuracy and mimicry, and induction heads are causally necessary for this capability.
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
eess.AS 2026-03 unverdicted novelty 7.0

FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline
cs.SD 2026-02 unverdicted novelty 7.0

MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
cs.IR 2026-02 unverdicted novelty 7.0

SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme ac...
JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion
cs.GR 2026-01 unverdicted novelty 7.0

JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.
NUTSHELL: A Dataset for Abstract Generation from Scientific Talks
cs.CL 2025-02 unverdicted novelty 7.0

NUTSHELL is a new open dataset of ACL talks paired with abstracts, accompanied by baselines that demonstrate training benefits for speech-to-abstract generation while highlighting remaining challenges.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
DASB - Discrete Audio and Speech Benchmark
cs.SD 2024-06 unverdicted novelty 7.0

DASB is a new benchmark for discrete audio tokens showing semantic tokens outperform acoustic ones but discrete representations remain less robust than continuous features across domains.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
Codec-Robust Attacks on Audio LLMs
cs.SD 2026-05 unverdicted novelty 6.0

CodecAttack optimizes perturbations in neural audio codec latent space to reach 85.5% average target-substring ASR on compressed Opus audio while waveform baselines stay below 26%.
JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR
cs.CL 2026-05 unverdicted novelty 6.0

JSPG jointly combines semantic, pinyin, and glyph retrieval with an extended Smith-Waterman algorithm to dynamically filter keyword dictionaries and improve accuracy in Chinese contextual ASR.
A Semi-Supervised Framework for Speech Confidence Detection using Whisper
cs.SD 2026-05 unverdicted novelty 6.0

A hybrid semi-supervised framework fusing Whisper embeddings with acoustic and prosodic features achieves 0.751 Macro-F1 for speaker confidence detection and outperforms baselines including WavLM, HuBERT, and Wav2Vec 2.0.
STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts
cs.SD 2026-05 unverdicted novelty 6.0

STRUM is a multi-stage neural audio-to-chart system that achieves F1 scores of 0.838 (drums), 0.694 (bass), 0.651 (guitar), and 0.539 (vocals) on a 30-song benchmark with released code and models.
Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control
cs.CL 2026-05 unverdicted novelty 6.0

A consequence-aware evaluation framework applied to LLMs in ATC finds peak Risk Score of only 0.69 despite high macro-F1, with errors concentrated in high-impact entities.
Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping
cs.LG 2026-05 unverdicted novelty 6.0

Imagined speech can be decoded from MEG by mapping imagined brain responses to listened ones and applying a word decoder trained only on listened data, yielding significant above-chance decoding for held-out subjects.
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
cs.AI 2026-05 unverdicted novelty 6.0

Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
BlasBench: An Open Benchmark for Irish Speech Recognition
cs.CL 2026-04 conditional novelty 6.0

BlasBench supplies an Irish-aware normalizer and scoring harness that enables reproducible ASR comparisons and exposes a 33-43 point generalization gap for fine-tuned models versus 7-10 points for massively multilingual ones.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
cs.CV 2026-04 unverdicted novelty 6.0

SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
eess.AS 2026-04 unverdicted novelty 6.0

TASU2 adds controllability over uncertainty and error rate to text-derived CTC simulation, enabling better cross-modal alignment and low-resource adaptation for speech LLMs than prior text-only or TTS methods.
Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative
cs.SD 2026-03 accept novelty 6.0

RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.
Logics-Parsing-Omni Technical Report
cs.AI 2026-03 unverdicted novelty 6.0

Omni Parsing framework converts complex multimodal signals into locatable, enumerable, and traceable structured knowledge via hierarchical detection, recognition, and interpreting with strict evidence alignment.
FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec
cs.SD 2025-10 unverdicted novelty 6.0

FAC-FACodec is a controllable zero-shot foreign accent conversion framework using a factorized speech codec that adds an explicit parameter for adjusting pronunciation-level accent modification strength.
Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing
cs.LG 2025-07 conditional novelty 6.0

A multimodal machine learning framework fusing smartwatch audio and inertial sensing achieves macro F1 scores of 82% in lab and 77% in semi-naturalistic studies for detecting face-to-face conversations.
AudioPaLM: A Large Language Model That Can Speak and Listen
cs.CL 2023-06 unverdicted novelty 6.0

AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
eess.AS 2026-05 unverdicted novelty 5.0

Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
FormalASR: End-to-End Spoken Chinese to Formal Text
cs.CL 2026-05 unverdicted novelty 5.0

FormalASR fine-tunes small Qwen3-ASR models on new spoken-to-formal Chinese datasets to achieve direct transcription with up to 37.4% relative CER reduction over verbatim baselines.
CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering
cs.CV 2026-05 unverdicted novelty 5.0

CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.
Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces
cs.LG 2026-05 unverdicted novelty 5.0

Symphony is a medical-grade speech recognition system that decomposes transcription into specialized components and outperforms existing systems in clinical settings while matching them in general domains.
Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces
cs.LG 2026-05 unverdicted novelty 5.0

Symphony is a modular medical speech recognition system that outperforms state-of-the-art models on clinical terminology tasks while matching them on general speech.
Predicting Psychological Well-Being from Spontaneous Speech using LLMs
cs.CL 2026-05 unverdicted novelty 5.0

LLMs achieve Spearman correlations up to 0.8 for zero-shot Ryff PWB prediction from spontaneous speech, with added statistical and linguistic explainability analyses.
AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State
cs.CV 2026-05 unverdicted novelty 5.0

AllocMV uses a global planner to build a structured persistent state then solves a Multiple-Choice Knapsack Problem to allocate High-Gen, Mid-Gen, and Reuse compute branches, achieving an optimal Cost-Quality Ratio un...
Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing
cs.SD 2026-04 unverdicted novelty 5.0

Few-shot TTS adaptation combined with LLM-guided phoneme editing produces synthetic accented speech that improves ASR word error rates on real accented audio even in cross-speaker and ultra-low-data settings.
WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition
cs.CL 2026-04 unverdicted novelty 5.0

WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.
LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People
cs.AI 2026-04 unverdicted novelty 5.0

A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube
cs.CL 2026-04 unverdicted novelty 5.0

LLMs annotated 100 YouTube transcripts on cow urine health claims using a 14-category taxonomy, revealing that promoters rely on efficacy appeals and social proof while debunkers emphasize authority and rebuttal.
When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation
cs.CL 2026-04 unverdicted novelty 5.0

LLMs fail to reliably detect culturally embedded health misinformation on YouTube because promotional and debunking content share similar rhetorical registers that blend tradition with pseudo-science, and this limitat...
Migrant Voices, Local News: Insights on Bridging Community Needs with Media Content
cs.CL 2026-04 unverdicted novelty 5.0

Focus groups reveal topic gaps and readability barriers in local news for migrants, uncovered by applying standard NLP tools to 2000+ hyper-local articles.
Floorplan2Guide: LLM-Guided Floorplan Parsing for BLV Indoor Navigation
cs.AI 2025-12 unverdicted novelty 5.0

Floorplan2Guide uses LLMs to parse floor plans into navigable graphs, reporting up to 92% accuracy on short routes with 5-shot prompting and 15% gains from graph structure over direct visual reasoning for BLV indoor n...
Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms
cs.IR 2025-12 unverdicted novelty 5.0

ELERAG integrates Wikidata entity linking with hybrid RRF re-ranking into RAG and outperforms baselines on a custom Italian academic dataset while cross-encoder methods win on the general SQuAD-it dataset.
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning
cs.CV 2025-11 unverdicted novelty 5.0

AVATAAR reports relative gains of 5-8% over baseline on CinePile benchmark categories through agentic feedback for long video QA.
FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
cs.DC 2025-10 unverdicted novelty 5.0

FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.
Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech
cs.CL 2025-07 unverdicted novelty 5.0

Balalaika is a data-centric annotation pipeline for Russian speech that combines semantic VAD, ASR ensembling, and prosody enrichment to build a 5.1k-hour corpus showing gains in denoising and TTS.
"I Said Things I Needed to Hear Myself": Peer Support as an Emotional, Organisational, and Sociotechnical Practice in Singapore
cs.HC 2025-06 unverdicted novelty 5.0

An interview study with 20 Singapore peer supporters maps their emotional, organisational, and sociocultural practices and derives design directions for culturally responsive digital tools and responsible AI augmentat...
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
cs.CV 2024-07 conditional novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
Towards General Text Embeddings with Multi-stage Contrastive Learning
cs.CL 2023-08 unverdicted novelty 5.0

GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
Voice Biomarkers for Depression and Anxiety
cs.LG 2026-05 unverdicted novelty 4.0

Deep learning models extract content-agnostic voice biomarkers for depression and anxiety from a ~65k-utterance proprietary dataset, achieving 71% sensitivity and specificity when combined with lexical features.
Detecting Alarming Student Verbal Responses using Text and Audio Classifier
cs.CL 2026-04 unverdicted novelty 4.0

A hybrid text-plus-audio classifier framework is introduced to identify potentially troubling student responses by analyzing both what is said and how it is said.
Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference
cs.AI 2026-04 unverdicted novelty 4.0

A quantized int4 version of Nemotron ASR runs faster than real-time on CPU at 8.20% WER and 0.67 GB size, setting a new efficiency point for on-device streaming speech recognition.
From Multimodal Signals to Adaptive XR Experiences for De-escalation Training
cs.HC 2026-04 unverdicted novelty 4.0

An early multimodal XR prototype fuses five signal streams with an interpretation layer to detect escalation cues and enable adaptive de-escalation training.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 69 Pith papers

[1]

ﬁle number

Accessed: 2022-09-01. Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D., and Goel, S. Racial disparities in automated speech recog- nition. Proceedings of the National Academy of Sciences , 117(14):7684–7689, 2020. Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and H...

work page doi:10.5281/zenodo.2591652 2022
[4]

Remove any of the following words: hmm, mm, mhm, mmm, uh, um

work page
[5]

Remove whitespace characters that comes before an apostrophe ’

work page
[6]

Convert standard or informal contracted forms of English into the original form

work page
[7]

Remove commas ( ,) between digits

work page
[8]

Remove periods ( .) not followed by numbers

work page
[9]

Remove symbols as well as diacritics from the text, where symbols are the characters with the Unicode category starting with M, S, or P, except period, percent, and currency symbols that may be detected in the next step

work page
[10]

Ten thousand dollars

Detect any numeric expressions of numbers and currencies and replace with a form using Arabic numbers, e.g. “Ten thousand dollars”→ “$10000”

work page
[11]

Convert British spellings into American spellings

work page
[12]

Remove remaining symbols that are not part of any numeric expressions

work page
[13]

Replace any successive whitespace characters with a space. A different, language-speciﬁc set of transformations would be needed to equivalently normalize non-English text, but due to our lack of linguistic knowledge to build such normalizers for all languages, we resort to the following basic standardization for non-English text:

work page
[14]

Remove any phrases between matching brackets ( [, ])

work page
[15]

Remove any phrases between matching parentheses ( (, ))

work page
[16]

when the Unicode category of each character in the NFKC-normalized string starts with M, S, or P

Replace any markers, symbols, and punctuation characters with a space, i.e. when the Unicode category of each character in the NFKC-normalized string starts with M, S, or P

work page
[17]

make the text lowercase

work page
[18]

replace any successive whitespace characters with a space. Additionally, we put a space between every letter for the languages that do not use spaces to separate words, namely Chinese, Japanese, Thai, Lao, and Burmese, effectively measuring the character error rate instead. We note that the above is an imperfect solution, and it will sometimes produce uni...

work page 1988

[1] [1]

ﬁle number

Accessed: 2022-09-01. Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D., and Goel, S. Racial disparities in automated speech recog- nition. Proceedings of the National Academy of Sciences , 117(14):7684–7689, 2020. Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and H...

work page doi:10.5281/zenodo.2591652 2022

[2] [4]

Remove any of the following words: hmm, mm, mhm, mmm, uh, um

work page

[3] [5]

Remove whitespace characters that comes before an apostrophe ’

work page

[4] [6]

Convert standard or informal contracted forms of English into the original form

work page

[5] [7]

Remove commas ( ,) between digits

work page

[6] [8]

Remove periods ( .) not followed by numbers

work page

[7] [9]

Remove symbols as well as diacritics from the text, where symbols are the characters with the Unicode category starting with M, S, or P, except period, percent, and currency symbols that may be detected in the next step

work page

[8] [10]

Ten thousand dollars

Detect any numeric expressions of numbers and currencies and replace with a form using Arabic numbers, e.g. “Ten thousand dollars”→ “$10000”

work page

[9] [11]

Convert British spellings into American spellings

work page

[10] [12]

Remove remaining symbols that are not part of any numeric expressions

work page

[11] [13]

Replace any successive whitespace characters with a space. A different, language-speciﬁc set of transformations would be needed to equivalently normalize non-English text, but due to our lack of linguistic knowledge to build such normalizers for all languages, we resort to the following basic standardization for non-English text:

work page

[12] [14]

Remove any phrases between matching brackets ( [, ])

work page

[13] [15]

Remove any phrases between matching parentheses ( (, ))

work page

[14] [16]

when the Unicode category of each character in the NFKC-normalized string starts with M, S, or P

Replace any markers, symbols, and punctuation characters with a space, i.e. when the Unicode category of each character in the NFKC-normalized string starts with M, S, or P

work page

[15] [17]

make the text lowercase

work page

[16] [18]

replace any successive whitespace characters with a space. Additionally, we put a space between every letter for the languages that do not use spaces to separate words, namely Chinese, Japanese, Thai, Lao, and Burmese, effectively measuring the character error rate instead. We note that the above is an imperfect solution, and it will sometimes produce uni...

work page 1988