Evaluation of Automatic Speech Recognition Using Generative Large Language Models
Pith reviewed 2026-05-09 21:20 UTC · model grok-4.3
The pith
Large language models select the better automatic speech recognition hypothesis with 92-94 percent agreement to humans, compared to 63 percent for word error rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decoder-based large language models achieve 92-94 percent agreement with human annotators when selecting the semantically preferable hypothesis between two ASR candidates on the HATS dataset, exceeding the 63 percent agreement obtained with word error rate and outperforming embedding-based semantic metrics. Embeddings extracted from these same models yield semantic distance measures comparable to those from encoder architectures. The approach also supports qualitative classification of error categories, indicating a path toward evaluation that is both semantic and human-interpretable.
What carries the argument
Pairwise hypothesis selection, in which a prompted decoder-based large language model identifies which of two ASR transcriptions better preserves meaning.
If this is right
- ASR evaluation can prioritize semantic fidelity over exact word matches when selecting or ranking hypotheses.
- Generative embeddings from decoder models become a practical substitute for encoder embeddings in semantic distance calculations.
- LLM prompts enable direct qualitative breakdown of error types in ASR outputs without additional specialized tools.
- Semantic evaluation pipelines can reduce dependence on word error rate alone for system comparisons.
Where Pith is reading between the lines
- The same selection method could be tested on other sequence tasks such as machine translation or summarization where surface metrics also miss meaning.
- Production ASR systems might incorporate LLM-based scoring to flag outputs that humans would judge as semantically flawed even when word error rate is low.
- Scaling the approach to rank more than two hypotheses at once would require checking whether agreement with humans remains high.
Load-bearing premise
Agreement with human annotators on which of two ASR hypotheses is semantically superior serves as a sufficient and unbiased proxy for overall correctness across domains.
What would settle it
A controlled test on a fresh ASR dataset with independent human ratings against known ground-truth transcriptions, where LLMs frequently select the wrong hypothesis while word error rate aligns more closely with the actual errors.
Figures
read the original abstract
Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates decoder-based LLMs for semantic ASR evaluation via three methods: pairwise hypothesis selection, generative embedding distances, and qualitative error classification. On the HATS dataset, LLMs achieve 92-94% agreement with human annotators on hypothesis selection (vs. 63% for WER and lower for other semantic metrics), with decoder embeddings performing comparably to encoder models, positioning LLMs as a promising interpretable alternative to WER.
Significance. If the empirical results hold under proper controls, the work could meaningfully advance ASR evaluation by demonstrating that LLMs capture semantic fidelity better than surface metrics in at least one setting, with potential for more interpretable error analysis. The absence of cross-domain replication and downstream validation, however, confines the immediate significance to a narrow empirical observation rather than a general methodological advance.
major comments (3)
- [Abstract / Results] Abstract and results section: The headline 92-94% human agreement for hypothesis selection is reported only on the HATS dataset with no accompanying inter-annotator agreement statistics, model-size details, prompting specifications, or statistical significance tests; this directly undermines the claim that LLMs supply a robust semantic proxy.
- [Discussion / Conclusion] Discussion / Conclusion: The inference that LLM-based selection is a superior semantic proxy rests on HATS-specific pairwise judgments without cross-domain replication, without testing whether LLM-chosen hypotheses improve downstream semantic metrics in a real ASR pipeline, and without evidence that human pairwise labels on HATS are unbiased across error distributions or domains.
- [Methods] Methods: The three evaluation approaches are described at a high level but lack concrete implementation details (e.g., exact prompting templates for selection, how generative embeddings are extracted and normalized, or the error taxonomy used for qualitative classification), preventing assessment of reproducibility.
minor comments (2)
- [Abstract] The abstract states that decoder embeddings are 'comparable' to encoder models but supplies no numerical values or tables for direct comparison.
- [Introduction / Methods] Notation for the three approaches could be introduced with explicit labels or equations to improve clarity when referring back to them in the results.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, indicating where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and results section: The headline 92-94% human agreement for hypothesis selection is reported only on the HATS dataset with no accompanying inter-annotator agreement statistics, model-size details, prompting specifications, or statistical significance tests; this directly undermines the claim that LLMs supply a robust semantic proxy.
Authors: We agree that the abstract and results would be strengthened by these details. In the revised manuscript we have added inter-annotator agreement statistics for the HATS annotations, specified the model sizes and variants evaluated, included the exact prompting templates in a new appendix, and reported statistical significance tests (McNemar) against WER and other baselines. These additions support the robustness of the reported agreement rates. revision: yes
-
Referee: [Discussion / Conclusion] Discussion / Conclusion: The inference that LLM-based selection is a superior semantic proxy rests on HATS-specific pairwise judgments without cross-domain replication, without testing whether LLM-chosen hypotheses improve downstream semantic metrics in a real ASR pipeline, and without evidence that human pairwise labels on HATS are unbiased across error distributions or domains.
Authors: We concur that the evaluation is limited to HATS and lacks cross-domain replication or downstream pipeline validation. The revised discussion now explicitly acknowledges these limitations, discusses possible biases in the HATS human labels, and outlines future work. However, new cross-domain experiments and downstream evaluations cannot be performed within the scope of this revision. revision: partial
-
Referee: [Methods] Methods: The three evaluation approaches are described at a high level but lack concrete implementation details (e.g., exact prompting templates for selection, how generative embeddings are extracted and normalized, or the error taxonomy used for qualitative classification), preventing assessment of reproducibility.
Authors: We have revised the methods section to supply the missing details: exact prompting templates appear in Appendix A, the generative embedding extraction and normalization procedure (including layer selection and cosine similarity) is now fully specified, and the error taxonomy with definitions and examples is provided in Section 3.3. These changes improve reproducibility. revision: yes
- Cross-domain replication and downstream ASR pipeline validation, which were not performed in the original study
Circularity Check
No circularity: empirical comparison to external human labels
full rationale
The paper reports experimental results on LLM agreement with human annotators for ASR hypothesis selection on the HATS dataset, directly comparing percentages (92-94% vs 63% WER) without any equations, derivations, fitted parameters renamed as predictions, or self-referential definitions. All load-bearing claims rest on external human judgments and standard metrics, with no reduction of outputs to inputs by construction. This is a standard empirical evaluation and self-contained against the provided benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotators provide reliable ground-truth judgments of semantic correctness for ASR hypotheses
Reference graph
Works this paper leans on
-
[1]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Qwen3 embedding: Advancing text embedding and reranking through foundation models , author=. arXiv preprint arXiv:2506.05176 , year=
work page internal anchor Pith review arXiv
-
[2]
International Conference on Learning Representations , year=
BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=
-
[3]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
Mega: Multilingual evaluation of generative ai , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2023
-
[4]
Findings of the association for computational linguistics: acl 2024 , pages=
Biomistral: A collection of open-source pretrained large language models for medical domains , author=. Findings of the association for computational linguistics: acl 2024 , pages=
work page 2024
-
[5]
ACM Transactions on Intelligent Systems and Technology , volume=
A comprehensive overview of large language models , author=. ACM Transactions on Intelligent Systems and Technology , volume=. 2025 , publisher=
work page 2025
-
[6]
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=
work page 2019
-
[7]
Improving language understanding by generative pre-training , author=. 2018 , publisher=
work page 2018
-
[8]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Gemma: Open Models Based on Gemini Research and Technology
Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=
work page internal anchor Pith review arXiv
-
[10]
Qualitative evaluation of language model rescoring in automatic speech recognition , author=. Interspeech , year=
-
[11]
arXiv preprint arXiv:2106.02016 , year=
Semantic-wer: A unified metric for the evaluation of asr transcript for end usability , author=. arXiv preprint arXiv:2106.02016 , year=
-
[12]
On the use of information retrieval measures for speech recognition evaluation , author=. 2004 , publisher=
work page 2004
-
[13]
Automatic human utility evaluation of ASR systems: Does WER really predict performance? , author=. Proc. Interspeech 2013 , pages=
work page 2013
-
[14]
A metric for evaluating speech recognizer output based on human-perception model , author=. Proc. Interspeech 2015 , pages=
work page 2015
-
[15]
Evaluating the usability of automatically generated captions for people who are deaf or hard of hearing , author=. Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility , pages=
-
[16]
International Journal of Semantic Computing , volume=
Simulation of subjective closed captioning quality assessment using prediction models , author=. International Journal of Semantic Computing , volume=. 2019 , publisher=
work page 2019
-
[17]
International Conference on Text, Speech, and Dialogue , pages=
A Paradigm for Interpreting Metrics and Measuring Error Severity in Automatic Speech Recognition , author=. International Conference on Text, Speech, and Dialogue , pages=. 2024 , organization=
work page 2024
-
[18]
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining , pages=
Meaning Error Rate: ASR domain-specific metric framework , author=. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining , pages=
-
[19]
Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding , author=. Proc. Interspeech 2021 , pages=
work page 2021
-
[20]
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=
There’s no comparison: Reference-less evaluation metrics in grammatical error correction , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2016
-
[21]
A reference-less quality metric for automatic speech recognition via contrastive-learning of a multi-language model with self-supervision , author=. 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) , pages=. 2023 , organization=
work page 2023
-
[22]
2024 32nd European Signal Processing Conference (EUSIPCO) , pages=
A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition applied on French Language , author=. 2024 32nd European Signal Processing Conference (EUSIPCO) , pages=. 2024 , organization=
work page 2024
-
[23]
International Conference on Text, Speech, and Dialogue , pages=
HATS: An Open Data Set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics , author=. International Conference on Text, Speech, and Dialogue , pages=
-
[24]
Better evaluation of ASR in speech translation context using word embeddings , author=. Interspeech 2016 , year=
work page 2016
-
[25]
Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric , author=. Proc. Interspeech 2022 , pages=
work page 2022
-
[26]
Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
Advocating character error rate for multilingual ASR evaluation , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
work page 2025
-
[27]
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=
Why word error rate is not a good metric for speech recognizer training for the speech translation task? , author=. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2011 , organization=
work page 2011
- [28]
-
[29]
Sdialog: A python toolkit for end-to-end agent building, user simulation, dialog generation, and evaluation , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages=
-
[30]
2003 IEEE workshop on automatic speech recognition and understanding (IEEE Cat
Is word error rate a good indicator for spoken language understanding accuracy , author=. 2003 IEEE workshop on automatic speech recognition and understanding (IEEE Cat. No. 03EX721) , pages=. 2003 , organization=
work page 2003
-
[31]
SeMaScore: A new evaluation metric for automatic speech recognition tasks , author=. Proc. Interspeech 2024 , pages=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.