LLM generative error correction improves low-resource Frisian ASR performance, with comparable gains on a contamination-controlled offline dataset confirming true correction ability.
wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 8roles
method 1polarities
use method 1representative citing papers
A joint fullband-subband model using high-resolution 44.1 kHz audio outperforms standard 16 kHz detectors for singing voice deepfake detection by exploiting spectrum-specific synthesis artifacts.
APEX generates four types of prototype-based explanations for pre-trained audio classifiers that preserve output invariance and target acoustic properties better than gradient methods applied to spectrograms.
Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.
Speech-based depression detection models primarily learn speaker identity rather than depression biomarkers, with performance dropping sharply on unseen speakers even under adversarial training.
Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.
The IQRA 2026 challenge on Arabic mispronunciation detection reports a 0.28 F1-score gain from new authentic human error data and diverse modeling approaches including self-supervised and audio-language models.
Supervised contrastive learning as an auxiliary loss during CTC fine-tuning improves accent robustness in ASR, yielding up to 29% relative WER reduction on unseen accents.
citing papers explorer
-
Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian
LLM generative error correction improves low-resource Frisian ASR performance, with comparable gains on a contamination-controlled offline dataset confirming true correction ability.
-
Joint Fullband-Subband Modeling for High-Resolution SingFake Detection
A joint fullband-subband model using high-resolution 44.1 kHz audio outperforms standard 16 kHz detectors for singing voice deepfake detection by exploiting spectrum-specific synthesis artifacts.
-
APEX: Audio Prototype EXplanations for Classification Tasks
APEX generates four types of prototype-based explanations for pre-trained audio classifiers that preserve output invariance and target acoustic properties better than gradient methods applied to spectrograms.
-
Multimodal LLMs are not all you need for Pediatric Speech Language Pathology
Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.
-
Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection
Speech-based depression detection models primarily learn speaker identity rather than depression biomarkers, with performance dropping sharply on unseen speakers even under adversarial training.
-
Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR
Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.
-
IQRA 2026: Interspeech Challenge on Automatic Pronunciation Assessment for Modern Standard Arabic (MSA)
The IQRA 2026 challenge on Arabic mispronunciation detection reports a 0.28 F1-score gain from new authentic human error data and diverse modeling approaches including self-supervised and audio-language models.
-
Contrastive Regularization for Accent-Robust ASR
Supervised contrastive learning as an auxiliary loss during CTC fine-tuning improves accent robustness in ASR, yielding up to 29% relative WER reduction on unseen accents.