A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition applied on French Language

Jane Wottawa; Mickael Rouvier; Richard Dufour; Thibault Ba\~neras-Roux

arxiv: 2605.03696 · v1 · submitted 2026-05-05 · 💻 cs.CL

A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition applied on French Language

Thibault Ba\~neras-Roux , Mickael Rouvier , Jane Wottawa , Richard Dufour This is my paper

Pith reviewed 2026-05-07 16:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords automatic speech recognitiontokenizationself-supervised learningFrench languageevaluation metricssubword tokenizationend-to-end ASRqualitative analysis

0 comments

The pith

Subword tokenization and self-supervised learning choices affect French end-to-end ASR in ways missed by standard error rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a qualitative study on French that examines how different subword tokenization algorithms and self-supervised learning models influence end-to-end automatic speech recognition systems. It applies a broad collection of linguistic and acoustic evaluation metrics instead of depending only on character and word error rates. Standard metrics have been shown to give an incomplete view of transcript quality for actual applications, so this multi-perspective approach aims to expose differences that matter for real use. The work focuses on French to highlight tokenization and pretraining effects from both linguistic structure and acoustic fidelity standpoints.

Core claim

The authors conduct a qualitative study on the French language that investigates the impact of subword tokenization algorithms and self-supervised learning models from different linguistic and acoustic perspectives, using a comprehensive set of evaluation metrics beyond character and word error rates.

What carries the argument

Comprehensive set of linguistic and acoustic evaluation metrics used to compare subword tokenizers and SSL models in French ASR.

If this is right

Tokenization algorithms produce transcripts with measurably different linguistic properties in French ASR outputs.
Self-supervised models lead to varying acoustic characteristics in the resulting speech-to-text transcripts.
Relying solely on CER and WER overlooks aspects of ASR quality relevant to downstream tasks.
Model and hyperparameter selection for French ASR can be guided by multi-metric analysis rather than error rates alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers working on French ASR might test multiple tokenizers specifically for semantic preservation in addition to accuracy.
The approach could be extended to other morphologically rich languages to check whether similar metric gaps exist.
ASR pipelines may eventually adopt routine multi-perspective evaluation as a standard step before deployment.

Load-bearing premise

The chosen linguistic and acoustic metrics sufficiently capture downstream application quality and that observed differences are attributable to tokenization and SSL choices rather than other uncontrolled factors.

What would settle it

No measurable differences appearing in the full suite of linguistic and acoustic metrics when swapping tokenization algorithms or SSL models would undermine the claim that these choices produce distinct effects beyond standard error rates.

Figures

Figures reproduced from arXiv: 2605.03696 by Jane Wottawa, Mickael Rouvier, Richard Dufour, Thibault Ba\~neras-Roux.

**Figure 1.** Figure 1: Architecture of the Automatic Speech Recognition view at source ↗

read the original abstract

The performance of end-to-end automatic speech recognition (ASR) systems enables their increasing integration into numerous applications. While there are various benefits to such speech-to-text systems, the choice of hyperparameters and models plays a crucial role in their performance. Typically, these choices are determined by considering only the character (CER) and/or word error rate (WER) metrics. However, it has been shown in several studies that these metrics are largely incomplete and fail to adequately describe the downstream application of automatic transcripts. In this paper, we conduct a qualitative study on the French language that investigates the impact of subword tokenization algorithms and self-supervised learning models from different linguistic and acoustic perspectives, using a comprehensive set of evaluation metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a solid but incremental empirical comparison of existing tokenizers and SSL models on French ASR using extra linguistic and acoustic metrics beyond WER/CER.

read the letter

The core point is that standard error rates miss important aspects of ASR output quality, so the authors run a side-by-side test of common subword tokenizers and self-supervised pre-training models on French data and track a wider set of measures. That is the main thing a reader should take away: they treat French as the target language and show how choices in tokenization and pre-training shift outcomes on morphology, acoustics, and other dimensions that matter for downstream use. The work is straightforward and stays within the bounds of what the experiments can support. No new algorithm or derivation appears, which keeps expectations realistic. The expanded metric suite is the useful addition here, since prior studies have already noted that WER alone is incomplete. They apply this to an under-studied language, which adds some practical value for anyone building French systems. The soft spots sit in the experimental details rather than the framing. The abstract and high-level description do not spell out data splits, hyperparameter controls, or statistical testing on the metric differences, so it remains unclear how cleanly the observed gaps trace back to tokenization and SSL choices versus other setup factors. If those controls are tight in the full paper, the findings hold; if not, the differences could be noisier than presented. Reproducibility will depend on whether code and exact configurations are released. This paper is for ASR practitioners who need guidance on tokenizer selection for French or similar languages and for researchers who want to see evaluation practices expanded. It does not move the theoretical frontier, but it supplies concrete comparisons that could inform model choices. I would send it to peer review. The empirical grounding is honest and the topic is relevant enough that referees can usefully check the controls and suggest tighter analysis.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a comparative empirical study on end-to-end ASR for French, examining the effects of multiple subword tokenization algorithms and self-supervised learning models. It argues that CER and WER alone are incomplete for assessing transcript quality and instead evaluates performance from linguistic and acoustic perspectives using an expanded set of metrics.

Significance. If the observed differences in the additional metrics are robust and attributable to the tokenization/SSL choices, the work could usefully inform model selection for French ASR in downstream tasks where linguistic structure or acoustic fidelity matters beyond raw error rates. The multi-perspective evaluation approach is a positive contribution to the field.

major comments (2)

[Experiments] Experiments section: No statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) are reported for differences across the linguistic and acoustic metrics. This is load-bearing because the central claim is that the study reveals impacts of tokenization and SSL choices; without tests it is impossible to distinguish signal from noise.
[Methodology and Results] Methodology and Results: The manuscript does not describe controls for confounding variables such as model parameter count, training data volume, or optimizer settings across the compared systems. This undermines the attribution of metric differences to tokenization/SSL rather than other experimental factors, which is the weakest assumption in the study design.

minor comments (2)

[Introduction] The abstract and introduction repeat the claim that CER/WER are 'largely incomplete' without citing the specific prior studies that demonstrated this for French or similar languages.
Figure captions and axis labels in the results plots are sometimes too small or lack units, reducing clarity when comparing metric values across conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and indicate the changes we will implement in the revised version.

read point-by-point responses

Referee: Experiments section: No statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) are reported for differences across the linguistic and acoustic metrics. This is load-bearing because the central claim is that the study reveals impacts of tokenization and SSL choices; without tests it is impossible to distinguish signal from noise.

Authors: We agree that statistical significance testing strengthens the interpretation of metric differences. In the revised manuscript, we will add bootstrap confidence intervals (with 1000 resamples) for all reported linguistic and acoustic metric differences to quantify uncertainty and distinguish robust effects from noise. revision: yes
Referee: Methodology and Results: The manuscript does not describe controls for confounding variables such as model parameter count, training data volume, or optimizer settings across the compared systems. This undermines the attribution of metric differences to tokenization/SSL rather than other experimental factors, which is the weakest assumption in the study design.

Authors: We acknowledge the limitation in isolating variables, as our study evaluates practical combinations of tokenizers with off-the-shelf SSL models (e.g., wav2vec2, HuBERT) using their standard pretrained configurations. In revision, we will add a dedicated table in the methodology section detailing parameter counts, pretraining data sizes, and fine-tuning settings for each system, along with a discussion of potential confounders and how the results reflect real-world model selection rather than fully controlled ablations. revision: partial

Circularity Check

0 steps flagged

Empirical study with no derivations or predictions

full rationale

The paper presents an empirical qualitative study on the effects of subword tokenization and self-supervised learning models for French end-to-end ASR, evaluated via multiple linguistic and acoustic metrics. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or self-citation load-bearing premises are present. The central claim is simply that the authors performed and reported the comparison; this is self-contained experimental description without any reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the domain assumption that an expanded set of linguistic and acoustic metrics provides a meaningfully more complete picture of ASR quality than CER/WER alone; no free parameters, new axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5434 in / 1041 out tokens · 49115 ms · 2026-05-07T16:27:59.666296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Neural machine translation of rare words with subword units,

R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in54th Annual Meeting of the Asso- ciation for Computational Linguistics. Association for Computational Linguistics (ACL), 2016

work page 2016
[2]

Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,

T. Kudo and J. Richardson, “Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,” inConference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018

work page 2018
[3]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, 2020

work page 2020
[4]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021

work page 2021
[5]

Comparative study of different tokenization strategies for streaming end-to-end asr,

S. Singh, A. Gupta, A. Maghan, D. Gowda, S. Singh, and C. Kim, “Comparative study of different tokenization strategies for streaming end-to-end asr,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021

work page 2021
[6]

Bertscore: Evaluating text generation with bert,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,” inInternational Conference on Learning Representations, 2020

work page 2020
[7]

Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding,

S. Kim, A. Arora, D. Le, C.-F. Yeh, C. Fuegen, O. Kalinli, and M. L. Seltzer, “Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding,” inInterspeech, 2021

work page 2021
[8]

Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recog- nition,

T. Ba ˜neras-Roux, M. Rouvier, J. Wottawa, and R. Dufour, “Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recog- nition,” inInterspeech 2022, 2022

work page 2022
[9]

Japanese and korean voice search,

M. Schuster and K. Nakajima, “Japanese and korean voice search,” in 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2012

work page 2012
[10]

Subword regularization: Improving neural network translation models with multiple subword candidates,

T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in56th Annual Meeting of the Association for Computational Linguistics, 2018

work page 2018
[11]

Task agnostic and task specific self-supervised learning from speech with lebench- mark,

S. Evain, M. H. Nguyen, H. Le, M. Z. Boito, S. Mdhaffar, S. Alisamir, Z. Tong, N. Tomashenko, M. Dinarelli, T. Parcolletet al., “Task agnostic and task specific self-supervised learning from speech with lebench- mark,” inThirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021), 2021

work page 2021
[12]

XLS-R: Self-supervised cross- lingual speech representation learning at scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pinoet al., “XLS-R: Self-supervised cross- lingual speech representation learning at scale,” 2021

work page 2021
[13]

Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News,

S. Galliano, E. Geoffrois, G. Gravier, J.-F. Bonastre, D. Mostefa, and K. Choukri, “Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News,” inInternational Conference on Language Resources and Evaluation (LREC), 2006

work page 2006
[14]

The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts,

S. Galliano, G. Gravier, and L. Chaubard, “The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts,” in Tenth Annual Conference of the International Speech Communication Association, 2009

work page 2009
[15]

The EPAC corpus: manual and automatic annotations of conversational speech in French broadcast news,

Y . Esteve, T. Bazillon, J.-Y . Antoine, F. B ´echet, and J. Farinas, “The EPAC corpus: manual and automatic annotations of conversational speech in French broadcast news,” inInternational Conference on Language Resources and Evaluation (LREC), 2010

work page 2010
[16]

The ETAPE corpus for the evaluation of speech-based TV content processing in the French language,

G. Gravier, G. Adda, N. Paulsson, M. Carr ´e, A. Giraudel, and O. Galib- ert, “The ETAPE corpus for the evaluation of speech-based TV content processing in the French language,” inInternational Conference on Language Resources and Evaluation (LREC), 2012

work page 2012
[17]

The repere corpus: a multimodal corpus for person recognition,

A. Giraudel, M. Carr ´e, V . Mapelli, J. Kahn, O. Galibert, and L. Quintard, “The repere corpus: a multimodal corpus for person recognition,” inInternational Conference on Language Resources and Evaluation (LREC), 2012

work page 2012
[18]

SpeechBrain: A general-purpose speech toolkit,

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lu- gosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yehet al., “SpeechBrain: A general-purpose speech toolkit,” 2021

work page 2021
[19]

Qualitative evaluation of asr adaptation in a lecture context: Application to the pastel corpus

S. Mdhaffar, Y . Est `eve, N. Hernandez, A. Laurent, R. Dufour, and S. Quiniou, “Qualitative evaluation of asr adaptation in a lecture context: Application to the pastel corpus.” inInterspeech, 2019

work page 2019
[20]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” inConference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

work page 2019
[21]

CamemBERT: a Tasty French Language Model,

L. Martin, B. Muller, P. J. O. Su ´arez, Y . Dupont, L. Romary, ´E. V . De La Clergerie, D. Seddah, and B. Sagot, “CamemBERT: a Tasty French Language Model,” in58th Annual Meeting of the Association for Computational Linguistics, 2020

work page 2020
[22]

HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics,

T. Ba ˜neras-Roux, J. Wottawa, M. Rouvier, T. Merlin, and R. Dufour, “HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics,” inText, Speech and Dialogue, 2023

work page 2023
[23]

LeBenchmark: A Reproducible Framework for Assess- ing Self-Supervised Representation Learning from Speech,

S. Evain, H. Nguyen, H. Le, M. Zanon Boito, S. Mdhaffar, S. Alisamir, Z. Tong, N. Tomashenko, M. Dinarelli, T. Parcollet, A. Allauzen, Y . Est`eve, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, and L. Besacier, “LeBenchmark: A Reproducible Framework for Assess- ing Self-Supervised Representation Learning from Speech,” inINTER- SPEECH 2021: ...

work page 2021

[1] [1]

Neural machine translation of rare words with subword units,

R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in54th Annual Meeting of the Asso- ciation for Computational Linguistics. Association for Computational Linguistics (ACL), 2016

work page 2016

[2] [2]

Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,

T. Kudo and J. Richardson, “Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,” inConference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018

work page 2018

[3] [3]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, 2020

work page 2020

[4] [4]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021

work page 2021

[5] [5]

Comparative study of different tokenization strategies for streaming end-to-end asr,

S. Singh, A. Gupta, A. Maghan, D. Gowda, S. Singh, and C. Kim, “Comparative study of different tokenization strategies for streaming end-to-end asr,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021

work page 2021

[6] [6]

Bertscore: Evaluating text generation with bert,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,” inInternational Conference on Learning Representations, 2020

work page 2020

[7] [7]

Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding,

S. Kim, A. Arora, D. Le, C.-F. Yeh, C. Fuegen, O. Kalinli, and M. L. Seltzer, “Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding,” inInterspeech, 2021

work page 2021

[8] [8]

Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recog- nition,

T. Ba ˜neras-Roux, M. Rouvier, J. Wottawa, and R. Dufour, “Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recog- nition,” inInterspeech 2022, 2022

work page 2022

[9] [9]

Japanese and korean voice search,

M. Schuster and K. Nakajima, “Japanese and korean voice search,” in 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2012

work page 2012

[10] [10]

Subword regularization: Improving neural network translation models with multiple subword candidates,

T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in56th Annual Meeting of the Association for Computational Linguistics, 2018

work page 2018

[11] [11]

Task agnostic and task specific self-supervised learning from speech with lebench- mark,

S. Evain, M. H. Nguyen, H. Le, M. Z. Boito, S. Mdhaffar, S. Alisamir, Z. Tong, N. Tomashenko, M. Dinarelli, T. Parcolletet al., “Task agnostic and task specific self-supervised learning from speech with lebench- mark,” inThirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021), 2021

work page 2021

[12] [12]

XLS-R: Self-supervised cross- lingual speech representation learning at scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pinoet al., “XLS-R: Self-supervised cross- lingual speech representation learning at scale,” 2021

work page 2021

[13] [13]

Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News,

S. Galliano, E. Geoffrois, G. Gravier, J.-F. Bonastre, D. Mostefa, and K. Choukri, “Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News,” inInternational Conference on Language Resources and Evaluation (LREC), 2006

work page 2006

[14] [14]

The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts,

S. Galliano, G. Gravier, and L. Chaubard, “The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts,” in Tenth Annual Conference of the International Speech Communication Association, 2009

work page 2009

[15] [15]

The EPAC corpus: manual and automatic annotations of conversational speech in French broadcast news,

Y . Esteve, T. Bazillon, J.-Y . Antoine, F. B ´echet, and J. Farinas, “The EPAC corpus: manual and automatic annotations of conversational speech in French broadcast news,” inInternational Conference on Language Resources and Evaluation (LREC), 2010

work page 2010

[16] [16]

The ETAPE corpus for the evaluation of speech-based TV content processing in the French language,

G. Gravier, G. Adda, N. Paulsson, M. Carr ´e, A. Giraudel, and O. Galib- ert, “The ETAPE corpus for the evaluation of speech-based TV content processing in the French language,” inInternational Conference on Language Resources and Evaluation (LREC), 2012

work page 2012

[17] [17]

The repere corpus: a multimodal corpus for person recognition,

A. Giraudel, M. Carr ´e, V . Mapelli, J. Kahn, O. Galibert, and L. Quintard, “The repere corpus: a multimodal corpus for person recognition,” inInternational Conference on Language Resources and Evaluation (LREC), 2012

work page 2012

[18] [18]

SpeechBrain: A general-purpose speech toolkit,

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lu- gosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yehet al., “SpeechBrain: A general-purpose speech toolkit,” 2021

work page 2021

[19] [19]

Qualitative evaluation of asr adaptation in a lecture context: Application to the pastel corpus

S. Mdhaffar, Y . Est `eve, N. Hernandez, A. Laurent, R. Dufour, and S. Quiniou, “Qualitative evaluation of asr adaptation in a lecture context: Application to the pastel corpus.” inInterspeech, 2019

work page 2019

[20] [20]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” inConference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

work page 2019

[21] [21]

CamemBERT: a Tasty French Language Model,

L. Martin, B. Muller, P. J. O. Su ´arez, Y . Dupont, L. Romary, ´E. V . De La Clergerie, D. Seddah, and B. Sagot, “CamemBERT: a Tasty French Language Model,” in58th Annual Meeting of the Association for Computational Linguistics, 2020

work page 2020

[22] [22]

HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics,

T. Ba ˜neras-Roux, J. Wottawa, M. Rouvier, T. Merlin, and R. Dufour, “HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics,” inText, Speech and Dialogue, 2023

work page 2023

[23] [23]

LeBenchmark: A Reproducible Framework for Assess- ing Self-Supervised Representation Learning from Speech,

S. Evain, H. Nguyen, H. Le, M. Zanon Boito, S. Mdhaffar, S. Alisamir, Z. Tong, N. Tomashenko, M. Dinarelli, T. Parcollet, A. Allauzen, Y . Est`eve, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, and L. Besacier, “LeBenchmark: A Reproducible Framework for Assess- ing Self-Supervised Representation Learning from Speech,” inINTER- SPEECH 2021: ...

work page 2021