Where Do Self-Supervised Speech Models Become Unfair?
Pith reviewed 2026-05-10 04:06 UTC · model grok-4.3
The pith
Self-supervised speech models embed biases against certain speaker groups from their first layers, with bias patterns that invert between speaker identification and speech recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Self-supervised speech encoder models produce embeddings biased against certain speaker groups for both speaker identification and automatic speech recognition tasks, starting at the very first latent layers. SID bias is minimized in layers that minimize overall SID error, while ASR bias is maximized in layers that minimize overall ASR error. The inverse bias/error relationship for ASR remains when probing models that have been fine-tuned for ASR, indicating that speaker-group bias is established during pretraining and resists removal by later adaptation.
What carries the argument
Layer-by-layer probing of embeddings from self-supervised speech models using separate classifiers for speaker identification and automatic speech recognition to track bias magnitude against speaker groups.
If this is right
- Speaker-group bias in these models is fixed early and originates in pretraining rather than task-specific adaptation.
- For automatic speech recognition the layers with best overall accuracy are also the most biased, unlike the speaker-identification case.
- Fine-tuning for ASR leaves the layerwise bias pattern intact, so post-hoc adaptation does not equalize performance across groups.
- Early layers already encode the differential treatment of speaker groups, so fairness interventions must address representation formation at the start of the network.
Where Pith is reading between the lines
- Pretraining objectives or data composition may need direct inspection to reduce speaker bias before any downstream use.
- Practitioners could select different layers depending on whether the goal is identification accuracy or recognition fairness.
- The same early-bias phenomenon may appear in other self-supervised audio or multimodal models trained with similar contrastive or predictive losses.
Load-bearing premise
That the chosen probing classifiers and bias metrics for SID and ASR faithfully capture embedding-level bias without introducing their own artifacts, and that the selected speaker groups and datasets are representative enough for the observed patterns to generalize.
What would settle it
Demonstrating that ASR bias in the lowest-error layers drops after fine-tuning, or that the first latent layers show no measurable speaker-group bias, would falsify the central claims.
Figures
read the original abstract
Speech encoder models are known to model members of some speaker groups (SGs) better than others. However, there has been little work in establishing why this occurs on a technological level. To our knowledge, we present the first layerwise fairness analysis of pretrained self-supervised speech encoder models (S3Ms), probing each embedding layer for speaker identification (SID) automatic speech recognition (ASR). We find S3Ms produce embeddings biased against certain SGs for both tasks, starting at the very first latent layers. Furthermore, we find opposite patterns of layerwise bias for SID vs ASR for all models in our study: SID bias is minimized in layers that minimize overall SID error; on the other hand, ASR bias is maximized in layers that minimize overall ASR error. The inverse bias/error relationship for ASR is unaffected when probing S3Ms that are finetuned for ASR, suggesting SG-level bias is established during pretraining and is difficult to remove.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the first layerwise fairness analysis of pretrained self-supervised speech encoder models (S3Ms), probing embeddings from each layer for speaker identification (SID) and automatic speech recognition (ASR) tasks across multiple models. It claims that S3Ms produce embeddings biased against certain speaker groups (SGs) starting at the earliest latent layers, with opposite layerwise patterns: SID bias is minimized in layers that minimize overall SID error, while ASR bias is maximized in layers that minimize overall ASR error. The inverse bias/error relationship for ASR persists even when probing ASR-finetuned S3Ms, suggesting that SG-level bias originates during pretraining and resists removal via finetuning.
Significance. If the empirical patterns hold under rigorous validation, the work would demonstrate that speaker-group biases in S3Ms are entrenched early in the representation hierarchy and difficult to mitigate through standard task-specific finetuning. This has direct implications for building equitable speech systems and points toward the need for pretraining-stage interventions. The systematic cross-model, cross-task layerwise design is a strength, providing falsifiable observations that could inform both theory and practice in self-supervised speech modeling.
major comments (3)
- [Section 4] Section 4 (Probing Methodology): The central claims on layerwise bias emergence, opposite SID/ASR patterns, and persistence after finetuning depend on the assumption that the chosen probing classifiers (likely linear or low-capacity) and bias metrics (accuracy disparity or equivalent) faithfully isolate intrinsic embedding-level bias. Without explicit controls for class imbalance in SG distributions, non-linear group differences, or probe capacity ablations, the reported minimization/maximization points could arise from probe training dynamics rather than S3M properties, directly undermining the finetuning-invariance conclusion.
- [Section 5.2] Section 5.2 and Figure 3: The identification of layers that 'minimize overall SID error' or 'maximize ASR bias' lacks reported error bars, confidence intervals, or statistical tests comparing adjacent layers. This makes it impossible to determine whether the opposite bias/error relationships are robust or sensitive to post-hoc layer selection and dataset-specific SG distributions.
- [Section 5.3] Section 5.3 (Finetuning Experiments): The claim that the inverse ASR bias/error relationship is unaffected by ASR finetuning requires confirmation that identical probe architectures and training protocols were used pre- and post-finetuning; any mismatch in probe capacity could artifactually preserve the pattern without reflecting true embedding bias.
minor comments (3)
- [Figure 2] Figure 2: The dual y-axes for bias and error curves are not clearly labeled or color-coded, making it difficult to visually confirm the claimed opposite patterns across layers.
- [Abstract] Abstract and Section 2: The term 'S3Ms' is used before its parenthetical expansion is fully contextualized for readers unfamiliar with the abbreviation.
- [Table 1] Table 1: Dataset and SG statistics should include explicit counts per speaker group to allow assessment of potential imbalance effects on the bias metrics.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important aspects of our probing methodology and experimental reporting that we will address to strengthen the manuscript.
read point-by-point responses
-
Referee: [Section 4] Section 4 (Probing Methodology): The central claims on layerwise bias emergence, opposite SID/ASR patterns, and persistence after finetuning depend on the assumption that the chosen probing classifiers (likely linear or low-capacity) and bias metrics (accuracy disparity or equivalent) faithfully isolate intrinsic embedding-level bias. Without explicit controls for class imbalance in SG distributions, non-linear group differences, or probe capacity ablations, the reported minimization/maximization points could arise from probe training dynamics rather than S3M properties, directly undermining the finetuning-invariance conclusion.
Authors: We employed linear probes following established practices in representation probing for speech models, as these assess the linear separability of speaker-group information in the embeddings. Class imbalance was addressed via balanced accuracy metrics and stratified sampling during probe training. We agree that additional ablations varying probe capacity (e.g., comparing linear vs. small MLP probes) would provide stronger evidence that the patterns originate from the S3M embeddings rather than probe dynamics. We will incorporate these ablations into the revised Section 4. revision: yes
-
Referee: [Section 5.2] Section 5.2 and Figure 3: The identification of layers that 'minimize overall SID error' or 'maximize ASR bias' lacks reported error bars, confidence intervals, or statistical tests comparing adjacent layers. This makes it impossible to determine whether the opposite bias/error relationships are robust or sensitive to post-hoc layer selection and dataset-specific SG distributions.
Authors: We reported average performance across multiple random seeds in the original submission but omitted error bars to maintain figure clarity. We will revise Figure 3 to include error bars and 95% confidence intervals, and add statistical comparisons (e.g., paired t-tests) between adjacent layers in Section 5.2 to confirm that the identified minima/maxima are robust to sampling variation and not artifacts of specific dataset splits. revision: yes
-
Referee: [Section 5.3] Section 5.3 (Finetuning Experiments): The claim that the inverse ASR bias/error relationship is unaffected by ASR finetuning requires confirmation that identical probe architectures and training protocols were used pre- and post-finetuning; any mismatch in probe capacity could artifactually preserve the pattern without reflecting true embedding bias.
Authors: Identical linear probe architectures, optimization settings, and evaluation protocols were used for both pretrained and finetuned models, as described in the experimental setup. We will expand the text in Section 5.3 to explicitly restate this consistency and reference the shared hyperparameters, removing any potential ambiguity. revision: partial
Circularity Check
No circularity: purely empirical layerwise probing with independent measurements
full rationale
The paper performs empirical analysis by training probing classifiers on each layer of pretrained S3Ms and computing bias metrics (accuracy disparity across speaker groups) for SID and ASR tasks. No equations, derivations, or fitted parameters are defined such that the reported layerwise bias patterns or inverse bias/error relationships reduce to those inputs by construction. The central claims rest on direct observations from the probes rather than any self-referential fitting or self-citation chain. Self-citations, if present for model details or prior probing methods, are not load-bearing for the fairness findings, which remain falsifiable via the chosen datasets and metrics. This is a standard non-circular empirical study.
Axiom & Free-Parameter Ledger
free parameters (2)
- Speaker group definitions
- Bias metric and probe architecture
axioms (2)
- domain assumption Probing classifiers extract bias information present in the embeddings without introducing substantial new bias
- domain assumption The studied models and datasets are representative of current S3Ms
Reference graph
Works this paper leans on
-
[1]
Introduction Self-supervised speech processing models (S3Ms) have been repeatedly shown to perform better for certain Speaker Groups (SGs) than others on tasks such as automatic speech recognition (ASR) [1, 2, 3] and speaker identification (SID) [4]. Despite many attempts at reducing this fairness gap [5, 6, 7, 8, 9, 2], none has come near to closing it; ...
-
[2]
Background Self-supervised learning (SSL)Two popular techniques for training self-attention based speech processing models are 1) SSL + task-specific finetuning, and 2) task-specific end-to-end training [12]. The first method trains on a large unlabeled au- dio corpus using SSL objectives like contrastive loss, followed by adaptation to specific downstrea...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Methodology 3.1. Lightweight SID and ASR decoders at every layer We seek to measure each layer’s ability to model both SID and ASR, both overallandrelatively for each SG. To measure this, we train lightweight decoders for both ASR and SID based on embeddings from each layer of each encoder model. Follow- ing the SUPERB framework, our decoders are comprise...
-
[4]
Probing results Due to space constraints, we focus on a representative sub- set of S3Ms tested onSonos; we integrate all S3Ms and Fair-speechin Fig. 5. 4.1. Layerwise SG-level bias for SID and ASR Figs. 1-4 depict relative error rates inSonosfor is native, age, gender, and dialect respectively, for both SID and ASR probes. We obtain overall layerwise erro...
-
[5]
Conclusion and Future Work In this paper we probed speech encoder models at each layer for two complementary downstream tasks, ASR and SID. Re- gardingRQ 1, we find the very first layers of each S3M pro- duce embeddings biased against certain SGs for both tasks. This implies that unfairness is rooted deeply within conven- tionally pretrained S3Ms. Regardi...
-
[6]
ASR- FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems,
A. Rai, S. Rahangdale, U. Anand, and A. Mukherjee, “ASR- FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems,” May 2025
work page 2025
-
[7]
FairASR: Fair Audio Con- trastive Learning for Automatic Speech Recognition,
J. Kim, J. Yu, M. Kwon, and J. Kim, “FairASR: Fair Audio Con- trastive Learning for Automatic Speech Recognition,” Jun. 2025
work page 2025
-
[8]
Fairness of Automatic Speech Recognition in Cleft Lip and Palate Speech,
S. Bhattacharjee, J. Mishra, H. S. Shekhawat, and S. R. M. Prasanna, “Fairness of Automatic Speech Recognition in Cleft Lip and Palate Speech,” May 2025
work page 2025
-
[9]
Bias in Automated Speaker Recogni- tion,
W. T. Hutiri and A. Ding, “Bias in Automated Speaker Recogni- tion,” in2022 ACM Conference on Fairness Accountability and Transparency, Jun. 2022, pp. 230–247
work page 2022
-
[10]
Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning,
N. Das, S. Bodapati, M. Sunkara, S. Srinivasan, and D. H. Chau, “Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning,” Mar. 2021
work page 2021
-
[11]
Accent-Robust Automatic Speech Recognition Using Supervised and Unsupervised Wav2vec Em- beddings,
J. Li, V . Manohar, P. Chitkara, A. Tjandra, M. Picheny, F. Zhang, X. Zhang, and Y . Saraf, “Accent-Robust Automatic Speech Recognition Using Supervised and Unsupervised Wav2vec Em- beddings,” Oct. 2021
work page 2021
-
[12]
Exploring Gender Disparities in Automatic Speech Recognition Technology,
H. ElGhazaly, B. Mirheidari, N. S. Moosavi, and H. Christensen, “Exploring Gender Disparities in Automatic Speech Recognition Technology,” Feb. 2025
work page 2025
-
[13]
M. Garnerin, S. Rossato, and L. Besacier, “Investigating the Im- pact of Gender Representation in ASR Training Data: A Case Study on Librispeech,” inProceedings of the 3rd Workshop on Gender Bias in Natural Language Processing, M. Costa-jussa, H. Gonen, C. Hardmeier, and K. Webster, Eds. Online: As- sociation for Computational Linguistics, Aug. 2021, pp. 86–92
work page 2021
-
[14]
Improving Fair- ness in Speaker Recognition,
G. Fenu, G. Medda, M. Marras, and G. Meloni, “Improving Fair- ness in Speaker Recognition,” inProceedings of the 2020 Euro- pean Symposium on Software Engineering. Rome Italy: ACM, Nov. 2020, pp. 129–136
work page 2020
-
[15]
C. Sekkat, F. Leroy, S. Mdhaffar, B. P. Smith, Y . Est`eve, J. Dureau, and A. Coucke, “Sonos V oice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in V oice As- sistants,” inProceedings of the 2024 Joint International Con- ference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzola...
work page 2024
-
[16]
A study of bias mit- igation strategies for speaker recognition,
R. Peri, K. Somandepalli, and S. Narayanan, “A study of bias mit- igation strategies for speaker recognition,”Comput. Speech Lang., vol. 79, no. C, Apr. 2023
work page 2023
-
[17]
To- wards inclusive automatic speech recognition,
S. Feng, B. M. Halpern, O. Kudina, and O. Scharenborg, “To- wards inclusive automatic speech recognition,”Computer Speech & Language, vol. 84, p. 101567, Mar. 2024
work page 2024
-
[18]
Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Represen- tations,
A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Represen- tations,” Oct. 2020
work page 2020
-
[19]
HuBERT: Self-Supervised Speech Rep- resentation Learning by Masked Prediction of Hidden Units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “HuBERT: Self-Supervised Speech Rep- resentation Learning by Masked Prediction of Hidden Units,” Jun. 2021
work page 2021
-
[20]
Robust Speech Recognition via Large-Scale Weak Supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” Dec. 2022
work page 2022
-
[21]
WavLM: Large- Scale Self-Supervised Pre-Training for Full Stack Speech Pro- cessing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large- Scale Self-Supervised Pre-Training for Full Stack Speech Pro- cessing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, Oct. 2022
work page 2022
-
[22]
SUPERB: Speech processing Universal PERformance Benchmark,
S.-w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K.-t. Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H.-y. Lee, “SUPERB: Speech processing Universal PERformance Benchmark,” Oct. 2021
work page 2021
-
[23]
Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks,
H.-J. Na and J.-S. Park, “Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks,” Applied Sciences, vol. 11, p. 8412, Sep. 2021
work page 2021
-
[24]
Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning,
A. Jain, M. Upreti, and P. Jyothi, “Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning,” inInterspeech 2018. ISCA, Sep. 2018, pp. 2454–2458
work page 2018
-
[25]
Toward Fairness in Speech Recognition: Discovery and mitigation of per- formance disparities,
P. Dheram, M. Ramakrishnan, A. Raju, I.-F. Chen, B. King, K. Powell, M. Saboowala, K. Shetty, and A. Stolcke, “Toward Fairness in Speech Recognition: Discovery and mitigation of per- formance disparities,” inInterspeech 2022, Sep. 2022, pp. 1268– 1272
work page 2022
-
[26]
Fairness in Automatic Speech Recognition Isn’t a One-Size- Fits-All,
H. ElGhazaly, B. Mirheidari, H. Christensen, and N. S. Moosavi, “Fairness in Automatic Speech Recognition Isn’t a One-Size- Fits-All,” inFindings of the Association for Computational Lin- guistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 19 169–19 178
work page 2025
-
[27]
L. Alonzo-Canul, B. Lecouteux, and F. Portet, “Vers l’apprentissage de mod `eles auto-supervis ´es de reconnais- sance automatique de la parole plus ´equitables sans a priori d´emographique,” 2025
work page 2025
-
[28]
L. Maison and Y . Est`eve, “Some V oices are Too Common: Build- ing Fair Speech Recognition Systems Using the CommonV oice Dataset,” inINTERSPEECH 2023. ISCA, Aug. 2023, pp. 4428– 4432
work page 2023
-
[29]
Enhancing and Adversarial: Improve ASR with Speaker Labels,
W. Zhou, H. Wu, J. Xu, M. Zeineldeen, C. L ¨uscher, R. Schl ¨uter, and H. Ney, “Enhancing and Adversarial: Improve ASR with Speaker Labels,” inICASSP 2023 - 2023 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2023, pp. 1–5
work page 2023
-
[30]
T. Tanaka, R. Masumura, H. Sato, M. Ihori, K. Matsuura, T. Ashihara, and T. Moriya, “Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks,” inInterspeech 2022. ISCA, Sep. 2022, pp. 1066–1070
work page 2022
-
[31]
Investigating Phoneme Sim- ilarity with Artificially Accented Speech,
M. Masson and J. Carson-berndsen, “Investigating Phoneme Sim- ilarity with Artificially Accented Speech,” inProceedings of the 20th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, G. Nicolai, E. Chodroff, F. Mailhot, and C ¸ . C ¸¨oltekin, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 49–57
work page 2023
-
[32]
Quan- tifying Bias in Automatic Speech Recognition,
S. Feng, O. Kudina, B. M. Halpern, and O. Scharenborg, “Quan- tifying Bias in Automatic Speech Recognition,” Apr. 2021
work page 2021
-
[33]
I. Slaughter, C. Greenberg, R. Schwartz, and A. Caliskan, “Pre- trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition,” Oct. 2023
work page 2023
-
[34]
Y . Zhang, H. Fournier, R. Kalitvianski, M. Dinarelli, and F. Ringeval, “Beyond Static Emotions: Leveraging Multitask Learning to Model Dynamics of Dimensional Affect in Speech,” inText, Speech, and Dialogue, K. Ekˇstein, M. Konop´ık, O. Praˇz´ak, and F. P´artl, Eds. Cham: Springer Nature Switzerland, 2026, pp. 109–120
work page 2026
-
[35]
Joint Encoder-Decoder Self-Supervised Pre- training for ASR,
A. A and U. S, “Joint Encoder-Decoder Self-Supervised Pre- training for ASR,” Jun. 2022
work page 2022
-
[36]
SpeechBrain: A General-Purpose Speech Toolkit,
M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. D. Mori, and Y . Bengio, “SpeechBrain: A General-Purpose Speech Toolkit,” Jun. 2021
work page 2021
-
[37]
Towards measuring fairness in speech recog- nition: Fair-Speech dataset,
I.-E. Veliche, Z. Huang, V . A. Kochaniyan, F. Peng, O. Kalinli, and M. L. Seltzer, “Towards measuring fairness in speech recog- nition: Fair-Speech dataset,” Aug. 2024
work page 2024
-
[38]
Fairness definitions explained,
S. Verma and J. Rubin, “Fairness definitions explained,” inPro- ceedings of the International Workshop on Software Fairness, ser. FairWare ’18. New York, NY , USA: Association for Computing Machinery, May 2018, pp. 1–7
work page 2018
-
[39]
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps,
G. Attanasio, B. Savoldi, D. Fucci, and D. Hovy, “Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps,” Oct. 2024
work page 2024
-
[40]
MinMax fairness: From Rawlsian Theory of Justice to solution for algorithmic bias,
F. Barsotti and R. G. Koc ¸er, “MinMax fairness: From Rawlsian Theory of Justice to solution for algorithmic bias,”AI & SOCI- ETY, vol. 39, no. 3, pp. 961–974, Jun. 2024
work page 2024
-
[41]
Self-supervised Learning with Random-projection Quantizer for Speech Recogni- tion,
C.-C. Chiu, J. Qin, Y . Zhang, J. Yu, and Y . Wu, “Self-supervised Learning with Random-projection Quantizer for Speech Recogni- tion,” Jun. 2022
work page 2022
-
[42]
Open Implementation and Study of BEST-RQ for Speech Processing,
R. Whetten, T. Parcollet, M. Dinarelli, and Y . Est `eve, “Open Implementation and Study of BEST-RQ for Speech Processing,” Sep. 2024
work page 2024
-
[43]
XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,
A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,” Dec. 2021
work page 2021
-
[44]
T. Parcollet, H. Nguyen, S. Evain, M. Z. Boito, A. Pupier, S. Mdhaffar, H. Le, S. Alisamir, N. Tomashenko, M. Dinarelli, S. Zhang, A. Allauzen, M. Coavoux, Y . Esteve, M. Rouvier, J. Goulian, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, and L. Besacier, “LeBenchmark 2.0: A Standardized, Replicable and Enhanced Framework for Self-supervised...
work page 2024
-
[45]
Libri-Light: A Benchmark for ASR with Limited or No Supervision,
J. Kahn, M. Rivi `ere, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazar´e, J. Karadayi, V . Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux, “Libri-Light: A Benchmark for ASR with Limited or No Supervision,” inICASSP 2020 - 2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (...
work page 2020
-
[46]
Lib- rispeech: An ASR corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. 5206–5210
work page 2015
-
[47]
Gender Representa- tion in French Broadcast Corpora and Its Impact on ASR Perfor- mance,
M. Garnerin, S. Rossato, and L. Besacier, “Gender Representa- tion in French Broadcast Corpora and Its Impact on ASR Perfor- mance,” Aug. 2019
work page 2019
-
[48]
X-Vectors: Robust DNN Embeddings for Speaker Recog- nition,
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-Vectors: Robust DNN Embeddings for Speaker Recog- nition,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018, pp. 5329– 5333
work page 2018
-
[49]
Speaker Group Encoding in Self-supervised Speech Recognition Models,
F. Herron, S. Rossato, A. Allauzen, B. Favre, and F. Portet, “Speaker Group Encoding in Self-supervised Speech Recognition Models,” inText, Speech, and Dialogue, K. Ekˇstein, M. Konop´ık, O. Praˇz´ak, and F. P´artl, Eds. Cham: Springer Nature Switzerland, 2025, pp. 121–132
work page 2025
-
[50]
Layer-wise Analysis of a Self-supervised Speech Representation Model,
A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise Analysis of a Self-supervised Speech Representation Model,” Dec. 2022
work page 2022
-
[51]
Mitigating bias against non-native accents,
Y . Zhang, Y . Zhang, B. Halpern, T. Patel, and O. Scharenborg, “Mitigating bias against non-native accents,” inProc. Interspeech 2022, 2022, pp. 3168–3172
work page 2022
-
[52]
Artie Bias Corpus: An Open Dataset for Detecting Demographic Bias in Speech Applications,
J. Meyer, L. Rauchenstein, J. D. Eisenberg, and N. Howell, “Artie Bias Corpus: An Open Dataset for Detecting Demographic Bias in Speech Applications,” inProceedings of the Twelfth Language Resources and Evaluation Conference, N. Calzolari, F. B ´echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa- hara, B. Maegaard, J. Mariani, H. Mazo, ...
work page 2020
-
[53]
Racial disparities in automated speech recognition,
A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Menge- sha, C. Toups, J. R. Rickford, D. Jurafsky, and S. Goel, “Racial disparities in automated speech recognition,”Proceedings of the National Academy of Sciences, vol. 117, no. 14, pp. 7684–7689, Apr. 2020
work page 2020
-
[54]
Don’t speak too fast: The impact of data bias on self-supervised speech models,
Y . Meng, Y .-H. Chou, A. T. Liu, and H.-y. Lee, “Don’t speak too fast: The impact of data bias on self-supervised speech models,” Apr. 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.