Dual-reference benchmarking on atypical stuttered speech reveals disparities in ASR model performance and rankings between verbatim and intended transcriptions.
What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
ASR systems have been often reported to underperform on atypical speech. An often conflated compounding factor is the existence of two valid transcription references: verbatim (actual produced speech, including repetitions/prolongations) and intended (the canonical form of the text with disfluencies removed) in atypical speech recognition depending on context and use-case. Most ASR evaluations conflate this duality into a single ground truth and reward systems that delete disfluencies, ignoring verbatim faithfulness. We benchmark 11 ASR models from encoder-decoder, CTC and transducer families using both verbatim and intended references on atypical stuttered speech as a case study. Our quantitative assessment underlines the disparity in model performance and rankings using the two transcript styles. Through this analysis, we highlight the importance of selecting a suitable transcription reference for valid model selection depending on the use-case, particularly for atypical ASR.
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR
Dual-reference benchmarking on atypical stuttered speech reveals disparities in ASR model performance and rankings between verbatim and intended transcriptions.