Fine-tuning Whisper on Swiss German speech with subtitle supervision yields an honest 25.6% WER baseline (13.8% cWER) and demonstrates that prior SOTA claims of 17% WER result from benchmark contamination allowing 13.88% WER with no dialect training.
Evaluation of llms in speech is often flawed: Test set contamination in large language models for speech recognition
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
dataset 1
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
dataset 1polarities
use dataset 1representative citing papers
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
AQUA-Bench evaluates audio QA models on three unanswerability scenarios: missing correct answers, mismatched choice sets, and questions irrelevant to the audio.
citing papers explorer
-
Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)
Fine-tuning Whisper on Swiss German speech with subtitle supervision yields an honest 25.6% WER baseline (13.8% cWER) and demonstrates that prior SOTA claims of 17% WER result from benchmark contamination allowing 13.88% WER with no dialect training.