arxiv: 2604.27533 · v1 · submitted 2026-04-30 · 💻 cs.CL

Recognition: unknown

Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

Thibault Ba\~neras-Roux , Micka\"el Rouvier , Jane Wottawa , Richard Dufour

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:09 UTC · model grok-4.3

classification 💻 cs.CL

keywords automatic speech recognitionlanguage model rescoringerror metricsPOSEREmbERword error ratemorpho-syntactic evaluationsemantic evaluation

0 comments

The pith

POSER and EmbER metrics show how language model rescoring improves grammar and semantics in ASR beyond word error rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces POSER, a part-of-speech error rate, and EmbER, an embedding-weighted error rate, to evaluate automatic speech recognition transcriptions. These metrics are applied to hypotheses before and after language model rescoring to isolate morpho-syntactic and semantic contributions that standard word error rate overlooks. A reader would care because WER treats all substitutions equally and misses whether errors affect sentence structure or meaning. The work demonstrates that rescoring yields measurable reductions in these finer-grained error types.

Core claim

The authors claim that POSER quantifies grammatical errors by comparing part-of-speech tags on erroneous words, while EmbER modifies the word error rate by weighting substitutions according to the semantic distance in embedding space. When these metrics are computed on rescored ASR outputs, they reveal the specific linguistic improvements delivered by language models that remain invisible to word error rate alone.

What carries the argument

POSER (part-of-speech error rate) and EmbER (embedding error rate), which respectively measure morpho-syntactic tag mismatches and semantically weighted word substitutions on transcription hypotheses.

If this is right

Rescoring with language models produces transcriptions with fewer part-of-speech errors, indicating better grammatical fidelity.
Wrong words after rescoring lie closer in embedding space to the reference, showing semantic quality gains.
The two metrics separate the linguistic effects of rescoring from raw lexical substitutions counted by WER.
ASR systems can be compared on targeted linguistic dimensions rather than a single aggregate score.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These metrics could be used to select language models optimized for particular downstream tasks that prize grammatical correctness or semantic fidelity.
The embedding component of EmbER might be replaced by task-specific distances to make the measure more sensitive to application needs.
The approach could be extended to measure error types such as named-entity accuracy or discourse coherence in rescored output.

Load-bearing premise

That reductions in POSER and EmbER after rescoring genuinely reflect useful linguistic gains rather than merely tracking the same improvements already captured by word error rate.

What would settle it

An experiment in which language model rescoring lowers word error rate yet leaves POSER and EmbER unchanged or higher, or in which human raters judge no improvement in grammar or meaning despite metric drops.

read the original abstract

Evaluating automatic speech recognition (ASR) systems is a classical but difficult and still open problem, which often boils down to focusing only on the word error rate (WER). However, this metric suffers from many limitations and does not allow an in-depth analysis of automatic transcription errors. In this paper, we propose to study and understand the impact of rescoring using language models in ASR systems by means of several metrics often used in other natural language processing (NLP) tasks in addition to the WER. In particular, we introduce two measures related to morpho-syntactic and semantic aspects of transcribed words: 1) the POSER (Part-of-speech Error Rate), which should highlight the grammatical aspects, and 2) the EmbER (Embedding Error Rate), a measurement that modifies the WER by providing a weighting according to the semantic distance of the wrongly transcribed words. These metrics illustrate the linguistic contributions of the language models that are applied during a posterior rescoring step on transcription hypotheses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes two new evaluation metrics for ASR systems to supplement WER: POSER (Part-of-speech Error Rate) to highlight morpho-syntactic/grammatical error aspects and EmbER (Embedding Error Rate) to weight transcription errors by semantic distance. These metrics are claimed to illustrate the specific linguistic contributions of language models applied during posterior rescoring of ASR hypotheses.

Significance. If POSER and EmbER were formally defined, implemented, and shown via experiments to isolate grammatical or semantic improvements from LM rescoring in a way not reducible to WER changes, the work would offer a useful qualitative lens for ASR evaluation. This could help practitioners better understand LM benefits beyond raw accuracy.

major comments (2)

Abstract: the claim that POSER and EmbER 'illustrate the linguistic contributions of the language models' is unsupported because the manuscript supplies neither explicit formulas, pseudocode, nor any implementation details for computing these metrics on ASR hypotheses.
Abstract and main text: no comparative results are presented (e.g., POSER/EmbER scores on the same hypotheses with vs. without LM rescoring, or correlation analysis with WER), so it is impossible to verify that the metrics capture LM-specific benefits or add information beyond WER.

minor comments (1)

The manuscript would be strengthened by adding a dedicated section with metric definitions, toy examples of POSER and EmbER calculation, and at least one small-scale experiment on a public ASR dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: Abstract: the claim that POSER and EmbER 'illustrate the linguistic contributions of the language models' is unsupported because the manuscript supplies neither explicit formulas, pseudocode, nor any implementation details for computing these metrics on ASR hypotheses.

Authors: We agree that the current presentation would benefit from greater formality. In the revised manuscript we will add explicit mathematical definitions for both POSER and EmbER, together with pseudocode that shows how each metric is computed from a hypothesis transcription, a reference transcription, and (for EmbER) pre-trained word embeddings. This will make the claimed linguistic contributions fully reproducible and directly verifiable. revision: yes
Referee: Abstract and main text: no comparative results are presented (e.g., POSER/EmbER scores on the same hypotheses with vs. without LM rescoring, or correlation analysis with WER), so it is impossible to verify that the metrics capture LM-specific benefits or add information beyond WER.

Authors: The manuscript already applies the metrics to rescored hypotheses, but we acknowledge that direct side-by-side comparisons and correlation analyses are missing. In the revision we will add tables and figures that report POSER and EmbER on identical hypothesis sets before and after LM rescoring, as well as Pearson and Spearman correlations between each new metric and WER. These additions will allow readers to assess whether the metrics isolate grammatical or semantic improvements that are not reducible to WER changes. revision: yes

Circularity Check

0 steps flagged

No circularity in metric proposal

full rationale

The manuscript proposes POSER and EmbER as new evaluation metrics for ASR rescoring effects and illustrates their intended linguistic sensitivity alongside WER. No equations, derivations, fitted parameters, or self-citation chains appear in the provided text. The central claim is a direct proposal of measures rather than a reduction of any output to its own inputs by construction, making the work self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The claim rests on the assumption that the two proposed metrics validly measure the targeted linguistic aspects; no free parameters, standard axioms, or invented physical entities are involved.

invented entities (2)

POSER metric no independent evidence
purpose: Measure part-of-speech errors to highlight grammatical aspects of ASR transcriptions
Newly defined in the paper based on POS tagging of errors.
EmbER metric no independent evidence
purpose: Weight word errors by semantic distance using embeddings to capture meaning aspects
Newly defined in the paper as a modification of WER.

pith-pipeline@v0.9.0 · 5481 in / 1068 out tokens · 59172 ms · 2026-05-07T10:09:18.461581+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Introduction Over the last years, various speech and language processing fields have made significant progress thanks to scientific and technological advances. Automatic Speech Recognition (ASR) has notably benefited from the massive increase in available data and the use of deep learning approaches [1, 2], making its models more robust and efficient [3]....
[2]

Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

Description of proposed measures ASR systems are mainly evaluated through the WER. In this section, we first describe it (Section 2.1) in order to highlight its advantages and limitations. Then we detail the 6 complemen- tary automatic measures that we wish to apply to the evaluation of automatic transcriptions at the syntactic (Sections 2.2, 2.3 and 2.4)...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

We describe the data used for our qualitative analysis of language model rescor- ing in Section 3.1, the ASR system and the POS tagger in Sec- tions 3.2 and 3.3 respectively

Experimental protocol In this section, we present the experimental protocol set up to apply the different metrics listed in Section 2. We describe the data used for our qualitative analysis of language model rescor- ing in Section 3.1, the ASR system and the POS tagger in Sec- tions 3.2 and 3.3 respectively. Finally, we present the embed- dings used by th...
[4]

Experiments and Analysis This section presents firstly an analysis of the six applied met- rics presented in Section 2 in addition to the WER, and secondly a qualitative study of the impact of the language model rescor- ing process used in our ASR system. 4.1. Metrics analysis In order to make a more in-depth analysis of our metrics, in par- ticular to un...
[5]

We have chosen to verify their relevance by studying the impact of a posteriori hypothesis reordering on ASR systems using language models

Conclusions and Perspectives In this study, we applied different measures in addition to the WER metric to ASR systems in order to reveal different lin- guistic dimensions (grammatical, semantic, etc.) to transcrip- tion errors. We have chosen to verify their relevance by studying the impact of a posteriori hypothesis reordering on ASR systems using langu...
[6]

It was granted access to the HPC resources of IDRIS under the allocation 2021-A0111012991 made by GENCI

Acknowledgments This work was supported by the DIETS project financed by the Agence Nationale de la Recherche (ANR) under contract ANR- 20-CE23-0005. It was granted access to the HPC resources of IDRIS under the allocation 2021-A0111012991 made by GENCI

2021
[7]

New types of deep neural network learning for speech recognition and related applications: An overview,

L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” inIEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013, pp. 8599– 8603

2013
[8]

Deep speech 2: End-to-end speech recognition in english and mandarin,

D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat- tenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” inInternational conference on machine learning. PMLR, 2016, pp. 173–182

2016
[9]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020

2020
[10]

Impact of word er- ror rate on theme identification task of highly imperfect human– human conversations,

M. Morchid, R. Dufour, and G. Linar `es, “Impact of word er- ror rate on theme identification task of highly imperfect human– human conversations,”Computer Speech & Language, vol. 38, pp. 68–85, 2016

2016
[11]

Qualitative evaluation of asr adaptation in a lecture context: Application to the pastel corpus

S. Mdhaffar, Y . Est`eve, N. Hernandez, A. Laurent, R. Dufour, and S. Quiniou, “Qualitative evaluation of asr adaptation in a lecture context: Application to the pastel corpus.” inInterSpeech, 2019, pp. 569–573

2019
[12]

Transformer-based end-to-end speech recognition with local dense synthesizer attention,

M. Xu, S. Li, and X.-L. Zhang, “Transformer-based end-to-end speech recognition with local dense synthesizer attention,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5899–5903

2021
[13]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review arXiv 1904
[14]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language under- standing,”arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review arXiv 2018
[15]

Evaluating user perception of speech recognition system quality with semantic distance metric,

S. Kim, D. Le, W. Zheng, T. Singh, A. Arora, X. Zhai, C. Fue- gen, O. Kalinli, and M. L. Seltzer, “Evaluating user perception of speech recognition system quality with semantic distance metric,” arXiv preprint arXiv:2110.05376, 2021

work page arXiv 2021
[16]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11

2019
[17]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

[Online]. Available: http://arxiv.org/abs/1908.10084

work page internal anchor Pith review arXiv 1908
[18]

Corpus description of the ester evaluation cam- paign for the rich transcription of french broadcast news

S. Galliano, E. Geoffrois, G. Gravier, J.-F. Bonastre, D. Mostefa, and K. Choukri, “Corpus description of the ester evaluation cam- paign for the rich transcription of french broadcast news.” inIn- ternational Conference on Language Resources and Evaluation (LREC), 2006, pp. 139–142

2006
[19]

The ester 2 evaluation campaign for the rich transcription of french radio broadcasts,

S. Galliano, G. Gravier, and L. Chaubard, “The ester 2 evaluation campaign for the rich transcription of french radio broadcasts,” in Tenth Annual Conference of the International Speech Communi- cation Association, 2009

2009
[20]

The epac corpus: manual and automatic annotations of conver- sational speech in french broadcast news,

Y . Esteve, T. Bazillon, J.-Y . Antoine, F. B ´echet, and J. Farinas, “The epac corpus: manual and automatic annotations of conver- sational speech in french broadcast news,” inInternational Con- ference on Language Resources and Evaluation (LREC), 2010

2010
[21]

The etape corpus for the evaluation of speech-based tv content processing in the french language,

G. Gravier, G. Adda, N. Paulsson, M. Carr ´e, A. Giraudel, and O. Galibert, “The etape corpus for the evaluation of speech-based tv content processing in the french language,” inInternational Conference on Language Resources and Evaluation (LREC), 2012, pp. 114–118

2012
[22]

The repere corpus: a multimodal corpus for person recognition,

A. Giraudel, M. Carr ´e, V . Mapelli, J. Kahn, O. Galibert, and L. Quintard, “The repere corpus: a multimodal corpus for person recognition,” inInternational Conference on Language Resources and Evaluation (LREC), 2012, pp. 1102–1107

2012
[23]

The kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The kaldi speech recognition toolkit,” inIEEE workshop on Au- tomatic Speech Recognition and Understanding (ASRU). IEEE Signal Processing Society, 2011

2011
[24]

Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks

D. Povey, G. Cheng, Y . Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks.” inInterspeech, 2018, pp. 3743– 3747

2018
[25]

Srilm-an extensible language modeling toolkit,

A. Stolcke, “Srilm-an extensible language modeling toolkit,” in Seventh international conference on spoken language processing, 2002

2002
[26]

Antilles: An open french linguistically enriched part-of-speech corpus,

Y . Labrak and R. Dufour, “Antilles: An open french linguistically enriched part-of-speech corpus,” inInternational Conference on Text, Speech, and Dialogue. Springer, 2022, pp. 28–38

2022
[27]

Contextual string embed- dings for sequence labeling,

A. Akbik, D. Blythe, and R. V ollgraf, “Contextual string embed- dings for sequence labeling,” inProceedings of the 27th interna- tional conference on computational linguistics, 2018, pp. 1638– 1649

2018
[28]

Enrich- ing word vectors with subword information,

P. Bojanowski, ´E. Grave, A. Joulin, and T. Mikolov, “Enrich- ing word vectors with subword information,”Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017

2017
[29]

Is word error rate a good indicator for spoken language understanding accuracy,

Y .-Y . Wang, A. Acero, and C. Chelba, “Is word error rate a good indicator for spoken language understanding accuracy,” inIEEE workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2003, pp. 577–582

2003
[30]

How to (prop- erly) evaluate cross-lingual word embeddings: On strong base- lines, comparative analyses, and some misconceptions,

G. Glava ˇs, R. Litschko, S. Ruder, and I. Vuli ´c, “How to (prop- erly) evaluate cross-lingual word embeddings: On strong base- lines, comparative analyses, and some misconceptions,” inPro- ceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, 2019, pp. 710–721

2019