HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

Jane Wottawa; Mickael Rouvier; Richard Dufour; Teva Merlin; Thibault Ba\~neras Roux

arxiv: 2604.27542 · v2 · submitted 2026-04-30 · 💻 cs.CL

HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

Thibault Ba\~neras Roux , Jane Wottawa , Mickael Rouvier , Teva Merlin , Richard Dufour This is my paper

Pith reviewed 2026-05-07 09:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords automatic speech recognitionevaluation metricshuman perceptionHATS datasetword error rateembedding metricstranscription qualityFrench speech

0 comments

The pith

New open dataset records human preferences on ASR transcript pairs to test metric alignment with listener perception.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HATS, an open French dataset built from side-by-side choices made by 143 humans between pairs of automatic speech recognition transcripts. It then measures how closely standard lexical metrics such as word error rate and newer embedding-based metrics match those human decisions. This approach treats human perception as the reference point rather than assuming word-level accuracy alone defines quality. If the correlations are weak, current metrics may steer ASR development toward outputs that people do not actually prefer.

Core claim

We introduce the Human Assessed Transcription Side-by-side (HATS) dataset, consisting of preference labels from 143 annotators who chose the better of two ASR hypotheses for French speech. Using this data, we examine the relationship between human judgments and both lexical metrics like word error rate and embedding-based semantic metrics.

What carries the argument

The HATS collection of human side-by-side preference labels on paired ASR transcriptions, used as ground truth to assess how well automatic metrics reflect perceived quality.

If this is right

Metrics showing strong correlation with HATS preferences can replace or supplement word error rate when selecting or training ASR models for human use.
The open dataset allows direct benchmarking of new evaluation methods against actual listener choices.
ASR systems optimized with human-aligned metrics may produce transcripts that users find more acceptable in practice.
Embedding-based metrics can be prioritized if they prove closer to HATS judgments than purely lexical ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Low correlations would indicate that error severity and fluency matter to humans in ways current metrics miss.
The side-by-side preference format could extend to other generative tasks such as machine translation where human judgment is the real target.
Releasing the dataset supports training of learned preference models that predict human choices without new annotations each time.

Load-bearing premise

Isolated text-only side-by-side choices by 143 humans on transcription pairs accurately stand in for human perception of ASR quality when people hear the original audio in realistic conditions.

What would settle it

A follow-up test in which participants listen to the source audio while choosing preferred transcripts, then check whether their selections match the text-only labels in HATS.

Figures

Figures reproduced from arXiv: 2604.27542 by Jane Wottawa, Mickael Rouvier, Richard Dufour, Teva Merlin, Thibault Ba\~neras Roux.

**Figure 1.** Figure 1: Screenshot from the side-by-side experiment view at source ↗

**Figure 2.** Figure 2: Participant characterization in terms of number of spoken languages view at source ↗

**Figure 3.** Figure 3: Participant characterization in terms of level of education. decided to study human behavior and metrics in complex situation, i.e. where humans have difficulties to choose the best transcription. In this context, the aim was to maximize the diversity of choices to be made: subjects had to choose among errors made by different systems (since it is unlikely that different systems produce identical errors). … view at source ↗

read the original abstract

Conventionally, Automatic Speech Recognition (ASR) systems are evaluated on their ability to correctly recognize each word contained in a speech signal. In this context, the word error rate (WER) metric is the reference for evaluating speech transcripts. Several studies have shown that this measure is too limited to correctly evaluate an ASR system, which has led to the proposal of other variants of metrics (weighted WER, BERTscore, semantic distance, etc.). However, they remain system-oriented, even when transcripts are intended for humans. In this paper, we firstly present Human Assessed Transcription Side-by-side (HATS), an original French manually annotated data set in terms of human perception of transcription errors produced by various ASR systems. 143 humans were asked to choose the best automatic transcription out of two hypotheses. We investigated the relationship between human preferences and various ASR evaluation metrics, including lexical and embedding-based ones, the latter being those that correlate supposedly the most with human perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HATS adds a French human preference dataset for ASR transcripts but shows no results yet and uses text-only pairwise choices that may not capture real listening-based perception.

read the letter

The paper's main contribution is the HATS dataset: 143 French speakers made side-by-side choices between pairs of ASR hypotheses. They plan to measure how well standard metrics (WER and variants, BERTScore, semantic distances) align with those choices. That dataset is new and fills a gap for non-English work on metric validation. The effort to gather human judgments is the part that stands out as useful, since everyone knows WER misses semantic and fluency issues that matter to listeners. The authors are straightforward about the goal and do not overclaim results in the abstract. The soft spots are straightforward. No correlation numbers, confidence intervals, or inter-annotator stats appear yet, so the central claim about which metrics track human perception remains untested in the text we have. More importantly, the protocol presents only the two text hypotheses; without the source audio or reference, annotators are effectively judging written fluency or grammaticality rather than how the errors sound in speech. That mismatch could make any later correlations less informative for actual ASR use cases. Minor details like exact exclusion rules and how the 143 participants were recruited also need to be spelled out before the data can be trusted for follow-on work. This is the kind of paper that matters to the small group of people building or benchmarking ASR metrics, especially those working on French or other under-resourced languages. A reader who needs a fresh human benchmark to test new embedding metrics would get value from the released data once the numbers are in. It is worth sending to peer review because new annotated preference sets in this area are rare and the basic idea is sound, even though the current version needs the missing results and a clearer defense of the annotation interface.

Referee Report

1 major / 2 minor

Summary. The paper introduces the HATS dataset of 143 human side-by-side preferences over pairs of ASR transcriptions (French), collected to capture human perception of transcription errors. It then reports correlations between these preferences and a range of ASR metrics, including lexical measures such as WER and embedding-based metrics that are hypothesized to align better with human judgments.

Significance. If the collected preferences validly reflect human assessment of ASR output quality under realistic conditions, the open dataset would be a useful resource for validating or improving automatic evaluation metrics beyond WER. The work directly addresses the known limitations of system-oriented metrics for human-facing applications.

major comments (1)

[Methods / Data Collection] Data collection protocol (described in the methods section): annotators are presented only with isolated pairs of text hypotheses and asked to choose the better transcription. This setup does not include the source audio, reference transcript, or any use-case context, so the resulting labels may capture relative textual fluency or grammaticality rather than perception of ASR-specific errors. Because this protocol is load-bearing for the central claim that HATS integrates 'human perception of transcription errors,' the subsequent metric-correlation results cannot be interpreted as evidence about which metrics best track real-world ASR quality.

minor comments (2)

[Abstract] The abstract states that embedding-based metrics 'correlate supposedly the most with human perception' without citing the specific prior studies or providing the exact correlation values obtained in this work.
[Results] No exclusion criteria, inter-annotator agreement statistics, or error bars on the reported correlations are mentioned in the abstract; these details should be added to the results section for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The concern regarding the data collection protocol is well-taken, as it directly impacts how the human judgments should be interpreted. We address this point below and outline planned revisions to improve clarity and avoid overstatement.

read point-by-point responses

Referee: [Methods / Data Collection] Data collection protocol (described in the methods section): annotators are presented only with isolated pairs of text hypotheses and asked to choose the better transcription. This setup does not include the source audio, reference transcript, or any use-case context, so the resulting labels may capture relative textual fluency or grammaticality rather than perception of ASR-specific errors. Because this protocol is load-bearing for the central claim that HATS integrates 'human perception of transcription errors,' the subsequent metric-correlation results cannot be interpreted as evidence about which metrics best track real-world ASR quality.

Authors: We agree that the protocol—presenting only text pairs without audio or reference—means the collected preferences primarily reflect relative textual quality (e.g., fluency, grammaticality, or overall readability) rather than direct human perception of ASR errors grounded in the audio signal. This design was intentional to simulate common downstream scenarios where users evaluate or compare ASR outputs as text alone (such as in subtitling or document review), but we acknowledge it limits claims about ASR-specific error perception. We will revise the manuscript as follows: (1) update the abstract, introduction, and title phrasing to emphasize 'human preferences on ASR transcripts' instead of 'human perception of transcription errors'; (2) expand the methods section to explicitly describe the text-only setup and its rationale; and (3) add a dedicated limitations subsection discussing how this affects interpretation of the metric correlations. These changes will ensure the dataset's scope is accurately represented without overstating its alignment with audio-based ASR quality assessment. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical data collection and correlation analysis

full rationale

The paper collects a new dataset of 143 human side-by-side preferences on ASR transcript pairs and reports correlations against lexical and embedding-based metrics. No equations, fitted parameters, predictions, or self-citations are used to derive results; the central contribution is the open dataset itself plus straightforward empirical comparison. All steps remain independent of any internal redefinition or self-referential justification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are introduced. The work rests on the domain assumption that aggregated human preferences constitute a valid reference for metric quality.

axioms (1)

domain assumption Human side-by-side preferences on transcription pairs reflect meaningful perception of ASR quality
This assumption underpins the use of the collected judgments to evaluate automatic metrics.

pith-pipeline@v0.9.0 · 5483 in / 1022 out tokens · 65300 ms · 2026-05-07T09:13:19.664830+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

Advances in Neural Information Processing Systems33, 12449–12460 (2020)

Baevski, A., Zhou, Y ., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems33, 12449–12460 (2020)

work page 2020
[2]

In: Interspeech 2022 (2022)

Bañeras-Roux, T., Rouvier, M., Wottawa, J., Dufour, R.: Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition. In: Interspeech 2022 (2022)

work page 2022
[3]

Transactions of the association for computational linguistics5, 135–146 (2017)

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the association for computational linguistics5, 135–146 (2017)

work page 2017
[4]

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers). pp. 4171–4186 (2019)

work page 2019
[5]

In: International Conference on Language Resources and Evaluation (LREC) (2010)

Esteve, Y ., Bazillon, T., Antoine, J.Y ., Béchet, F., Farinas, J.: The EPAC corpus: manual and automatic annotations of conversational speech in French broadcast news. In: International Conference on Language Resources and Evaluation (LREC) (2010)

work page 2010
[6]

In: Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) (2021) 10 Thibault Bañeras Roux et al

Evain, S., Nguyen, M.H., Le, H., Boito, M.Z., Mdhaffar, S., Alisamir, S., Tong, Z., Tomashenko, N., Dinarelli, M., Parcollet, T., et al.: Task agnostic and task specific self- supervised learning from speech with lebenchmark. In: Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) (2021) 10 Thibault Bañeras Roux et al

work page 2021
[7]

Favre, B., Cheung, K., Kazemian, S., Lee, A., Liu, Y ., Munteanu, C., Nenkova, A., Ochei, D., Penn, G., Tratz, S., et al.: Automatic human utility evaluation of ASR systems: Does WER really predict performance? In: INTERSPEECH. pp. 3463–3467 (2013)

work page 2013
[8]

In: Proceedings of the Seventh Conference on Machine Translation, Abu Dhabi

Freitag, M., Rei, R., Mathur, N., kiu Lo, C., Stewart, C., Avramidis, E., Kocmi, T., Foster, G., Lavie, A., Martins, A.F.: Results of WMT22 Metrics Shared Task: Stop Using BLEU–Neural Metrics Are Better and More Robust. In: Proceedings of the Seventh Conference on Machine Translation, Abu Dhabi. Association for Computational Linguistics (2022)

work page 2022
[9]

In: Proceedings of the Sixth Conference on Machine Translation

Freitag, M., Rei, R., Mathur, N., Lo, C.k., Stewart, C., Foster, G., Lavie, A., Bojar, O.: Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In: Proceedings of the Sixth Conference on Machine Translation. pp. 733–774 (2021)

work page 2021
[10]

In: International Conference on Language Resources and Evaluation (LREC)

Galliano, S., Geoffrois, E., Gravier, G., Bonastre, J.F., Mostefa, D., Choukri, K.: Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broad- cast News. In: International Conference on Language Resources and Evaluation (LREC). pp. 139–142 (2006)

work page 2006
[11]

In: Tenth Annual Conference of the International Speech Communication Association (2009)

Galliano, S., Gravier, G., Chaubard, L.: The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts. In: Tenth Annual Conference of the International Speech Communication Association (2009)

work page 2009
[12]

In: International Conference on Language Resources and Evaluation (LREC)

Giraudel, A., Carré, M., Mapelli, V ., Kahn, J., Galibert, O., Quintard, L.: The repere corpus: a multimodal corpus for person recognition. In: International Conference on Language Resources and Evaluation (LREC). pp. 1102–1107 (2012)

work page 2012
[13]

In: Proceedings of the 27th ACM SIGKDD Conference on Knowl- edge Discovery & Data Mining

Gordeeva, L., Ershov, V ., Gulyaev, O., Kuralenok, I.: Meaning Error Rate: ASR domain- specific metric framework. In: Proceedings of the 27th ACM SIGKDD Conference on Knowl- edge Discovery & Data Mining. pp. 458–466 (2021)

work page 2021
[14]

In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

Grave, É., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

work page 2018
[15]

In: International Conference on Language Resources and Evaluation (LREC)

Gravier, G., Adda, G., Paulsson, N., Carré, M., Giraudel, A., Galibert, O.: The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: International Conference on Language Resources and Evaluation (LREC). pp. 114–118 (2012)

work page 2012
[16]

IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3451–3460 (2021)

Hsu, W.N., Bolte, B., Tsai, Y .H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hu- bert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3451–3460 (2021)

work page 2021
[17]

In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

Itoh, N., Kurata, G., Tachibana, R., Nishimura, M.: A metric for evaluating speech recog- nizer output based on human-perception model. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

work page 2015
[18]

Technometrics 33(3), 251–272 (1991)

Juang, B.H., Rabiner, L.R.: Hidden Markov models for speech recognition. Technometrics 33(3), 251–272 (1991)

work page 1991
[19]

In: Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility

Kafle, S., Huenerfauth, M.: Evaluating the usability of automatically generated captions for people who are deaf or hard of hearing. In: Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility. pp. 165–174 (2017)

work page 2017
[20]

Singing voice graph modeling for singfake detection

Kim, S., Arora, A., Le, D., Yeh, C.F., Fuegen, C., Kalinli, O., Seltzer, M.L.: Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding. In: Proc. Interspeech 2021. pp. 1977–1981 (2021). https://doi.org/10.21437/Interspeech. 2021-1929

work page doi:10.21437/interspeech 2021
[21]

In: Proc

Kim, S., Le, D., Zheng, W., Singh, T., Arora, A., Zhai, X., Fuegen, C., Kalinli, O., Seltzer, M.: Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric. In: Proc. Interspeech 2022. pp. 3978–3982 (2022). https://doi.org/10.21437/ Interspeech.2022-11144 Human Perception Applied to the Evaluation of ASR Metrics 11

work page 2022
[22]

In: Proceedings of the 12th Language Resources and Evaluation Conference

Le, H., Vial, L., Frej, J., Segonne, V ., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of the 12th Language Resources and Evaluation Conference. pp. 2479–2490 (2020)

work page 2020
[23]

In: Proc

Le, N.T., Servan, C., Lecouteux, B., Besacier, L.: Better Evaluation of ASR in Speech Translation Context Using Word Embeddings. In: Proc. Interspeech 2016. pp. 2538–2542 (2016).https://doi.org/10.21437/Interspeech.2016-464

work page doi:10.21437/interspeech.2016-464 2016
[24]

In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Martin, L., Muller, B., Suárez, P.J.O., Dupont, Y ., Romary, L., De La Clergerie, É.V ., Seddah, D., Sagot, B.: CamemBERT: a Tasty French Language Model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 7203–7219 (2020)

work page 2020
[25]

In: Proceedings of the Fifth Conference on Machine Translation

Mathur, N., Wei, J., Freitag, M., Ma, Q., Bojar, O.: Results of the WMT20 metrics shared task. In: Proceedings of the Fifth Conference on Machine Translation. pp. 688–725 (2020)

work page 2020
[26]

In: INTER- SPEECH

Mdhaffar, S., Estève, Y ., Hernandez, N., Laurent, A., Dufour, R., Quiniou, S.: Qualitative evaluation of asr adaptation in a lecture context: Application to the pastel corpus. In: INTER- SPEECH. pp. 569–573 (2019)

work page 2019
[27]

International Journal of Semantic Computing13(01), 45–65 (2019)

Nam, S., Fels, D.: Simulation of Subjective Closed Captioning Quality Assessment Using Prediction Models. International Journal of Semantic Computing13(01), 45–65 (2019)

work page 2019
[28]

In: Proceedings of the international conference on Multimedia information retrieval

Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: a study about inter- annotator agreement for multi-label image annotation. In: Proceedings of the international conference on Multimedia information retrieval. pp. 557–566 (2010)

work page 2010
[29]

In: IEEE 2011 workshop on automatic speech recognition and understanding

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y ., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. No. CONF, IEEE Signal Processing Society (2011)

work page 2011
[30]

Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J.C., Yeh, S.L., Fu, S.W., Liao, C.F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y ., Mori, R.D., Bengio, Y .: SpeechBrain: A general- purpose speech toolkit (2021), arXiv:2106.04624

work page arXiv 2021
[31]

In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings using Siamese BERT- Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 3982–3992 (2019)

work page 2019
[32]

In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12)

Vasilescu, I., Adda-Decker, M., Lamel, L.: Cross-lingual studies of ASR errors: paradigms for perceptual evaluations. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12). pp. 3511–3518 (2012)

work page 2012
[33]

In: 2003 IEEE workshop on automatic speech recognition and understanding (IEEE Cat

Wang, Y .Y ., Acero, A., Chelba, C.: Is word error rate a good indicator for spoken language understanding accuracy. In: 2003 IEEE workshop on automatic speech recognition and understanding (IEEE Cat. No. 03EX721). pp. 577–582. IEEE (2003)

work page 2003
[34]

In: International Conference on Learning Representations (2020),https: //openreview.net/forum?id=SkeHuCVFDr

Zhang*, T., Kishore*, V ., Wu*, F., Weinberger, K.Q., Artzi, Y .: Bertscore: Evaluating text generation with bert. In: International Conference on Learning Representations (2020),https: //openreview.net/forum?id=SkeHuCVFDr

work page 2020

[1] [1]

Advances in Neural Information Processing Systems33, 12449–12460 (2020)

Baevski, A., Zhou, Y ., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems33, 12449–12460 (2020)

work page 2020

[2] [2]

In: Interspeech 2022 (2022)

Bañeras-Roux, T., Rouvier, M., Wottawa, J., Dufour, R.: Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition. In: Interspeech 2022 (2022)

work page 2022

[3] [3]

Transactions of the association for computational linguistics5, 135–146 (2017)

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the association for computational linguistics5, 135–146 (2017)

work page 2017

[4] [4]

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers). pp. 4171–4186 (2019)

work page 2019

[5] [5]

In: International Conference on Language Resources and Evaluation (LREC) (2010)

Esteve, Y ., Bazillon, T., Antoine, J.Y ., Béchet, F., Farinas, J.: The EPAC corpus: manual and automatic annotations of conversational speech in French broadcast news. In: International Conference on Language Resources and Evaluation (LREC) (2010)

work page 2010

[6] [6]

In: Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) (2021) 10 Thibault Bañeras Roux et al

Evain, S., Nguyen, M.H., Le, H., Boito, M.Z., Mdhaffar, S., Alisamir, S., Tong, Z., Tomashenko, N., Dinarelli, M., Parcollet, T., et al.: Task agnostic and task specific self- supervised learning from speech with lebenchmark. In: Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) (2021) 10 Thibault Bañeras Roux et al

work page 2021

[7] [7]

Favre, B., Cheung, K., Kazemian, S., Lee, A., Liu, Y ., Munteanu, C., Nenkova, A., Ochei, D., Penn, G., Tratz, S., et al.: Automatic human utility evaluation of ASR systems: Does WER really predict performance? In: INTERSPEECH. pp. 3463–3467 (2013)

work page 2013

[8] [8]

In: Proceedings of the Seventh Conference on Machine Translation, Abu Dhabi

Freitag, M., Rei, R., Mathur, N., kiu Lo, C., Stewart, C., Avramidis, E., Kocmi, T., Foster, G., Lavie, A., Martins, A.F.: Results of WMT22 Metrics Shared Task: Stop Using BLEU–Neural Metrics Are Better and More Robust. In: Proceedings of the Seventh Conference on Machine Translation, Abu Dhabi. Association for Computational Linguistics (2022)

work page 2022

[9] [9]

In: Proceedings of the Sixth Conference on Machine Translation

Freitag, M., Rei, R., Mathur, N., Lo, C.k., Stewart, C., Foster, G., Lavie, A., Bojar, O.: Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In: Proceedings of the Sixth Conference on Machine Translation. pp. 733–774 (2021)

work page 2021

[10] [10]

In: International Conference on Language Resources and Evaluation (LREC)

Galliano, S., Geoffrois, E., Gravier, G., Bonastre, J.F., Mostefa, D., Choukri, K.: Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broad- cast News. In: International Conference on Language Resources and Evaluation (LREC). pp. 139–142 (2006)

work page 2006

[11] [11]

In: Tenth Annual Conference of the International Speech Communication Association (2009)

Galliano, S., Gravier, G., Chaubard, L.: The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts. In: Tenth Annual Conference of the International Speech Communication Association (2009)

work page 2009

[12] [12]

In: International Conference on Language Resources and Evaluation (LREC)

Giraudel, A., Carré, M., Mapelli, V ., Kahn, J., Galibert, O., Quintard, L.: The repere corpus: a multimodal corpus for person recognition. In: International Conference on Language Resources and Evaluation (LREC). pp. 1102–1107 (2012)

work page 2012

[13] [13]

In: Proceedings of the 27th ACM SIGKDD Conference on Knowl- edge Discovery & Data Mining

Gordeeva, L., Ershov, V ., Gulyaev, O., Kuralenok, I.: Meaning Error Rate: ASR domain- specific metric framework. In: Proceedings of the 27th ACM SIGKDD Conference on Knowl- edge Discovery & Data Mining. pp. 458–466 (2021)

work page 2021

[14] [14]

In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

Grave, É., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

work page 2018

[15] [15]

In: International Conference on Language Resources and Evaluation (LREC)

Gravier, G., Adda, G., Paulsson, N., Carré, M., Giraudel, A., Galibert, O.: The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: International Conference on Language Resources and Evaluation (LREC). pp. 114–118 (2012)

work page 2012

[16] [16]

IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3451–3460 (2021)

Hsu, W.N., Bolte, B., Tsai, Y .H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hu- bert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3451–3460 (2021)

work page 2021

[17] [17]

In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

Itoh, N., Kurata, G., Tachibana, R., Nishimura, M.: A metric for evaluating speech recog- nizer output based on human-perception model. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

work page 2015

[18] [18]

Technometrics 33(3), 251–272 (1991)

Juang, B.H., Rabiner, L.R.: Hidden Markov models for speech recognition. Technometrics 33(3), 251–272 (1991)

work page 1991

[19] [19]

In: Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility

Kafle, S., Huenerfauth, M.: Evaluating the usability of automatically generated captions for people who are deaf or hard of hearing. In: Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility. pp. 165–174 (2017)

work page 2017

[20] [20]

Singing voice graph modeling for singfake detection

Kim, S., Arora, A., Le, D., Yeh, C.F., Fuegen, C., Kalinli, O., Seltzer, M.L.: Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding. In: Proc. Interspeech 2021. pp. 1977–1981 (2021). https://doi.org/10.21437/Interspeech. 2021-1929

work page doi:10.21437/interspeech 2021

[21] [21]

In: Proc

Kim, S., Le, D., Zheng, W., Singh, T., Arora, A., Zhai, X., Fuegen, C., Kalinli, O., Seltzer, M.: Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric. In: Proc. Interspeech 2022. pp. 3978–3982 (2022). https://doi.org/10.21437/ Interspeech.2022-11144 Human Perception Applied to the Evaluation of ASR Metrics 11

work page 2022

[22] [22]

In: Proceedings of the 12th Language Resources and Evaluation Conference

Le, H., Vial, L., Frej, J., Segonne, V ., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L., Schwab, D.: FlauBERT: Unsupervised Language Model Pre-training for French. In: Proceedings of the 12th Language Resources and Evaluation Conference. pp. 2479–2490 (2020)

work page 2020

[23] [23]

In: Proc

Le, N.T., Servan, C., Lecouteux, B., Besacier, L.: Better Evaluation of ASR in Speech Translation Context Using Word Embeddings. In: Proc. Interspeech 2016. pp. 2538–2542 (2016).https://doi.org/10.21437/Interspeech.2016-464

work page doi:10.21437/interspeech.2016-464 2016

[24] [24]

In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Martin, L., Muller, B., Suárez, P.J.O., Dupont, Y ., Romary, L., De La Clergerie, É.V ., Seddah, D., Sagot, B.: CamemBERT: a Tasty French Language Model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 7203–7219 (2020)

work page 2020

[25] [25]

In: Proceedings of the Fifth Conference on Machine Translation

Mathur, N., Wei, J., Freitag, M., Ma, Q., Bojar, O.: Results of the WMT20 metrics shared task. In: Proceedings of the Fifth Conference on Machine Translation. pp. 688–725 (2020)

work page 2020

[26] [26]

In: INTER- SPEECH

Mdhaffar, S., Estève, Y ., Hernandez, N., Laurent, A., Dufour, R., Quiniou, S.: Qualitative evaluation of asr adaptation in a lecture context: Application to the pastel corpus. In: INTER- SPEECH. pp. 569–573 (2019)

work page 2019

[27] [27]

International Journal of Semantic Computing13(01), 45–65 (2019)

Nam, S., Fels, D.: Simulation of Subjective Closed Captioning Quality Assessment Using Prediction Models. International Journal of Semantic Computing13(01), 45–65 (2019)

work page 2019

[28] [28]

In: Proceedings of the international conference on Multimedia information retrieval

Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: a study about inter- annotator agreement for multi-label image annotation. In: Proceedings of the international conference on Multimedia information retrieval. pp. 557–566 (2010)

work page 2010

[29] [29]

In: IEEE 2011 workshop on automatic speech recognition and understanding

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y ., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. No. CONF, IEEE Signal Processing Society (2011)

work page 2011

[30] [30]

Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J.C., Yeh, S.L., Fu, S.W., Liao, C.F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y ., Mori, R.D., Bengio, Y .: SpeechBrain: A general- purpose speech toolkit (2021), arXiv:2106.04624

work page arXiv 2021

[31] [31]

In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings using Siamese BERT- Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 3982–3992 (2019)

work page 2019

[32] [32]

In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12)

Vasilescu, I., Adda-Decker, M., Lamel, L.: Cross-lingual studies of ASR errors: paradigms for perceptual evaluations. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12). pp. 3511–3518 (2012)

work page 2012

[33] [33]

In: 2003 IEEE workshop on automatic speech recognition and understanding (IEEE Cat

Wang, Y .Y ., Acero, A., Chelba, C.: Is word error rate a good indicator for spoken language understanding accuracy. In: 2003 IEEE workshop on automatic speech recognition and understanding (IEEE Cat. No. 03EX721). pp. 577–582. IEEE (2003)

work page 2003

[34] [34]

In: International Conference on Learning Representations (2020),https: //openreview.net/forum?id=SkeHuCVFDr

Zhang*, T., Kishore*, V ., Wu*, F., Weinberger, K.Q., Artzi, Y .: Bertscore: Evaluating text generation with bert. In: International Conference on Learning Representations (2020),https: //openreview.net/forum?id=SkeHuCVFDr

work page 2020