pith. sign in

arxiv: 2604.18249 · v1 · submitted 2026-04-20 · 💻 cs.CL

Where Do Self-Supervised Speech Models Become Unfair?

Pith reviewed 2026-05-10 04:06 UTC · model grok-4.3

classification 💻 cs.CL
keywords self-supervised speech modelsspeaker biaslayerwise analysisspeaker identificationautomatic speech recognitionfairnesspretraining
0
0 comments X

The pith

Self-supervised speech models embed biases against certain speaker groups from their first layers, with bias patterns that invert between speaker identification and speech recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates where unfairness emerges inside pretrained self-supervised speech encoder models by checking every embedding layer for two downstream tasks. It finds that bias against particular speaker groups appears immediately in the earliest latent layers for both speaker identification and automatic speech recognition. The layerwise pattern is reversed across tasks: layers that minimize overall error for speaker identification also minimize bias, whereas layers that minimize overall error for speech recognition maximize bias. This speech-recognition bias pattern stays unchanged even after the models are fine-tuned on the recognition task, pointing to an origin in the original pretraining stage.

Core claim

Self-supervised speech encoder models produce embeddings biased against certain speaker groups for both speaker identification and automatic speech recognition tasks, starting at the very first latent layers. SID bias is minimized in layers that minimize overall SID error, while ASR bias is maximized in layers that minimize overall ASR error. The inverse bias/error relationship for ASR remains when probing models that have been fine-tuned for ASR, indicating that speaker-group bias is established during pretraining and resists removal by later adaptation.

What carries the argument

Layer-by-layer probing of embeddings from self-supervised speech models using separate classifiers for speaker identification and automatic speech recognition to track bias magnitude against speaker groups.

If this is right

  • Speaker-group bias in these models is fixed early and originates in pretraining rather than task-specific adaptation.
  • For automatic speech recognition the layers with best overall accuracy are also the most biased, unlike the speaker-identification case.
  • Fine-tuning for ASR leaves the layerwise bias pattern intact, so post-hoc adaptation does not equalize performance across groups.
  • Early layers already encode the differential treatment of speaker groups, so fairness interventions must address representation formation at the start of the network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pretraining objectives or data composition may need direct inspection to reduce speaker bias before any downstream use.
  • Practitioners could select different layers depending on whether the goal is identification accuracy or recognition fairness.
  • The same early-bias phenomenon may appear in other self-supervised audio or multimodal models trained with similar contrastive or predictive losses.

Load-bearing premise

That the chosen probing classifiers and bias metrics for SID and ASR faithfully capture embedding-level bias without introducing their own artifacts, and that the selected speaker groups and datasets are representative enough for the observed patterns to generalize.

What would settle it

Demonstrating that ASR bias in the lowest-error layers drops after fine-tuning, or that the first latent layers show no measurable speaker-group bias, would falsify the central claims.

Figures

Figures reproduced from arXiv: 2604.18249 by Alexandre Allauzen, Felix Herron, Fran\c{c}ois Portet, Maja Hjuler, Solange Rossato.

Figure 1
Figure 1. Figure 1: Layerwise evolution of relative error rate (see Eq. 1) for native/non-native speakers (solid lines) vs overall error (dot￾ted). Relative error > 0 implies an error rate higher than aver￾age, i.e. worse performance. Non-native speakers have increas￾ingly worse relative performance from layer to layer for ASR, though near-equal performance for SID. This is in contrast to by-age (see [PITH_FULL_IMAGE:figures… view at source ↗
Figure 2
Figure 2. Figure 2: Layerwise evolution of relative error for different ages (solid lines) vs overall error (dotted). Children (9-16) are worst modeled while older adults (42+) are best modeled, for both SID and ASR. Like [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layerwise evolution of relative dialect error (solid lines) vs overall error (dotted) for American native-English speakers [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Each dot represents one layer for each pretrained S3M, plotted according to its bias and overall error rate. Layers with low bias and low error will be at the bottom left of each plot; layers with high bias and low error will be at the bottom right of each plot. For SID, layers with low overall error also have low SG-level bias; for ASR, layers with low overall error have high SG-level bias. ASR error rate… view at source ↗
Figure 6
Figure 6. Figure 6: Relative overall WER (dots) and SG-level bias (Eq. 2, dot-dash) for ASR finetuned models relative to pretrained S3Ms on Sonos (see Eq. 1). Values < 0 means lower WER (dots) or less bias (dot-dash) than the pretrained S3M respectively. 4.2. Effect of finetuning We repeat our experiments on S3Ms that have been finetuned for ASR, both using the vanilla CTC algorithm as well as using the fairness-enhancing CTC… view at source ↗
read the original abstract

Speech encoder models are known to model members of some speaker groups (SGs) better than others. However, there has been little work in establishing why this occurs on a technological level. To our knowledge, we present the first layerwise fairness analysis of pretrained self-supervised speech encoder models (S3Ms), probing each embedding layer for speaker identification (SID) automatic speech recognition (ASR). We find S3Ms produce embeddings biased against certain SGs for both tasks, starting at the very first latent layers. Furthermore, we find opposite patterns of layerwise bias for SID vs ASR for all models in our study: SID bias is minimized in layers that minimize overall SID error; on the other hand, ASR bias is maximized in layers that minimize overall ASR error. The inverse bias/error relationship for ASR is unaffected when probing S3Ms that are finetuned for ASR, suggesting SG-level bias is established during pretraining and is difficult to remove.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript presents the first layerwise fairness analysis of pretrained self-supervised speech encoder models (S3Ms), probing embeddings from each layer for speaker identification (SID) and automatic speech recognition (ASR) tasks across multiple models. It claims that S3Ms produce embeddings biased against certain speaker groups (SGs) starting at the earliest latent layers, with opposite layerwise patterns: SID bias is minimized in layers that minimize overall SID error, while ASR bias is maximized in layers that minimize overall ASR error. The inverse bias/error relationship for ASR persists even when probing ASR-finetuned S3Ms, suggesting that SG-level bias originates during pretraining and resists removal via finetuning.

Significance. If the empirical patterns hold under rigorous validation, the work would demonstrate that speaker-group biases in S3Ms are entrenched early in the representation hierarchy and difficult to mitigate through standard task-specific finetuning. This has direct implications for building equitable speech systems and points toward the need for pretraining-stage interventions. The systematic cross-model, cross-task layerwise design is a strength, providing falsifiable observations that could inform both theory and practice in self-supervised speech modeling.

major comments (3)
  1. [Section 4] Section 4 (Probing Methodology): The central claims on layerwise bias emergence, opposite SID/ASR patterns, and persistence after finetuning depend on the assumption that the chosen probing classifiers (likely linear or low-capacity) and bias metrics (accuracy disparity or equivalent) faithfully isolate intrinsic embedding-level bias. Without explicit controls for class imbalance in SG distributions, non-linear group differences, or probe capacity ablations, the reported minimization/maximization points could arise from probe training dynamics rather than S3M properties, directly undermining the finetuning-invariance conclusion.
  2. [Section 5.2] Section 5.2 and Figure 3: The identification of layers that 'minimize overall SID error' or 'maximize ASR bias' lacks reported error bars, confidence intervals, or statistical tests comparing adjacent layers. This makes it impossible to determine whether the opposite bias/error relationships are robust or sensitive to post-hoc layer selection and dataset-specific SG distributions.
  3. [Section 5.3] Section 5.3 (Finetuning Experiments): The claim that the inverse ASR bias/error relationship is unaffected by ASR finetuning requires confirmation that identical probe architectures and training protocols were used pre- and post-finetuning; any mismatch in probe capacity could artifactually preserve the pattern without reflecting true embedding bias.
minor comments (3)
  1. [Figure 2] Figure 2: The dual y-axes for bias and error curves are not clearly labeled or color-coded, making it difficult to visually confirm the claimed opposite patterns across layers.
  2. [Abstract] Abstract and Section 2: The term 'S3Ms' is used before its parenthetical expansion is fully contextualized for readers unfamiliar with the abbreviation.
  3. [Table 1] Table 1: Dataset and SG statistics should include explicit counts per speaker group to allow assessment of potential imbalance effects on the bias metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of our probing methodology and experimental reporting that we will address to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Probing Methodology): The central claims on layerwise bias emergence, opposite SID/ASR patterns, and persistence after finetuning depend on the assumption that the chosen probing classifiers (likely linear or low-capacity) and bias metrics (accuracy disparity or equivalent) faithfully isolate intrinsic embedding-level bias. Without explicit controls for class imbalance in SG distributions, non-linear group differences, or probe capacity ablations, the reported minimization/maximization points could arise from probe training dynamics rather than S3M properties, directly undermining the finetuning-invariance conclusion.

    Authors: We employed linear probes following established practices in representation probing for speech models, as these assess the linear separability of speaker-group information in the embeddings. Class imbalance was addressed via balanced accuracy metrics and stratified sampling during probe training. We agree that additional ablations varying probe capacity (e.g., comparing linear vs. small MLP probes) would provide stronger evidence that the patterns originate from the S3M embeddings rather than probe dynamics. We will incorporate these ablations into the revised Section 4. revision: yes

  2. Referee: [Section 5.2] Section 5.2 and Figure 3: The identification of layers that 'minimize overall SID error' or 'maximize ASR bias' lacks reported error bars, confidence intervals, or statistical tests comparing adjacent layers. This makes it impossible to determine whether the opposite bias/error relationships are robust or sensitive to post-hoc layer selection and dataset-specific SG distributions.

    Authors: We reported average performance across multiple random seeds in the original submission but omitted error bars to maintain figure clarity. We will revise Figure 3 to include error bars and 95% confidence intervals, and add statistical comparisons (e.g., paired t-tests) between adjacent layers in Section 5.2 to confirm that the identified minima/maxima are robust to sampling variation and not artifacts of specific dataset splits. revision: yes

  3. Referee: [Section 5.3] Section 5.3 (Finetuning Experiments): The claim that the inverse ASR bias/error relationship is unaffected by ASR finetuning requires confirmation that identical probe architectures and training protocols were used pre- and post-finetuning; any mismatch in probe capacity could artifactually preserve the pattern without reflecting true embedding bias.

    Authors: Identical linear probe architectures, optimization settings, and evaluation protocols were used for both pretrained and finetuned models, as described in the experimental setup. We will expand the text in Section 5.3 to explicitly restate this consistency and reference the shared hyperparameters, removing any potential ambiguity. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical layerwise probing with independent measurements

full rationale

The paper performs empirical analysis by training probing classifiers on each layer of pretrained S3Ms and computing bias metrics (accuracy disparity across speaker groups) for SID and ASR tasks. No equations, derivations, or fitted parameters are defined such that the reported layerwise bias patterns or inverse bias/error relationships reduce to those inputs by construction. The central claims rest on direct observations from the probes rather than any self-referential fitting or self-citation chain. Self-citations, if present for model details or prior probing methods, are not load-bearing for the fairness findings, which remain falsifiable via the chosen datasets and metrics. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The paper is an empirical measurement study. It relies on standard assumptions about what linear or simple probes reveal about embeddings and on the representativeness of the chosen models and speaker groups. No new physical entities are postulated. Free parameters are limited to experimental design choices such as speaker-group categorization and layer selection.

free parameters (2)
  • Speaker group definitions
    Categorization of speakers into groups (e.g., by accent, gender, age) is chosen by the authors and directly affects measured bias.
  • Bias metric and probe architecture
    The specific way bias is quantified per layer and the form of the probing classifier are design choices that shape the reported patterns.
axioms (2)
  • domain assumption Probing classifiers extract bias information present in the embeddings without introducing substantial new bias
    The entire layerwise analysis rests on the validity of the SID and ASR probes as faithful readouts of embedding properties.
  • domain assumption The studied models and datasets are representative of current S3Ms
    Generalization from the specific pretrained models examined to broader claims about S3Ms.

pith-pipeline@v0.9.0 · 5471 in / 1551 out tokens · 31942 ms · 2026-05-10T04:06:28.336073+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

  1. [1]

    Introduction Self-supervised speech processing models (S3Ms) have been repeatedly shown to perform better for certain Speaker Groups (SGs) than others on tasks such as automatic speech recognition (ASR) [1, 2, 3] and speaker identification (SID) [4]. Despite many attempts at reducing this fairness gap [5, 6, 7, 8, 9, 2], none has come near to closing it; ...

  2. [2]

    Background Self-supervised learning (SSL)Two popular techniques for training self-attention based speech processing models are 1) SSL + task-specific finetuning, and 2) task-specific end-to-end training [12]. The first method trains on a large unlabeled au- dio corpus using SSL objectives like contrastive loss, followed by adaptation to specific downstrea...

  3. [3]

    likely speaker-disjoint

    Methodology 3.1. Lightweight SID and ASR decoders at every layer We seek to measure each layer’s ability to model both SID and ASR, both overallandrelatively for each SG. To measure this, we train lightweight decoders for both ASR and SID based on embeddings from each layer of each encoder model. Follow- ing the SUPERB framework, our decoders are comprise...

  4. [4]

    Probing results Due to space constraints, we focus on a representative sub- set of S3Ms tested onSonos; we integrate all S3Ms and Fair-speechin Fig. 5. 4.1. Layerwise SG-level bias for SID and ASR Figs. 1-4 depict relative error rates inSonosfor is native, age, gender, and dialect respectively, for both SID and ASR probes. We obtain overall layerwise erro...

  5. [5]

    Re- gardingRQ 1, we find the very first layers of each S3M pro- duce embeddings biased against certain SGs for both tasks

    Conclusion and Future Work In this paper we probed speech encoder models at each layer for two complementary downstream tasks, ASR and SID. Re- gardingRQ 1, we find the very first layers of each S3M pro- duce embeddings biased against certain SGs for both tasks. This implies that unfairness is rooted deeply within conven- tionally pretrained S3Ms. Regardi...

  6. [6]

    ASR- FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems,

    A. Rai, S. Rahangdale, U. Anand, and A. Mukherjee, “ASR- FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems,” May 2025

  7. [7]

    FairASR: Fair Audio Con- trastive Learning for Automatic Speech Recognition,

    J. Kim, J. Yu, M. Kwon, and J. Kim, “FairASR: Fair Audio Con- trastive Learning for Automatic Speech Recognition,” Jun. 2025

  8. [8]

    Fairness of Automatic Speech Recognition in Cleft Lip and Palate Speech,

    S. Bhattacharjee, J. Mishra, H. S. Shekhawat, and S. R. M. Prasanna, “Fairness of Automatic Speech Recognition in Cleft Lip and Palate Speech,” May 2025

  9. [9]

    Bias in Automated Speaker Recogni- tion,

    W. T. Hutiri and A. Ding, “Bias in Automated Speaker Recogni- tion,” in2022 ACM Conference on Fairness Accountability and Transparency, Jun. 2022, pp. 230–247

  10. [10]

    Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning,

    N. Das, S. Bodapati, M. Sunkara, S. Srinivasan, and D. H. Chau, “Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning,” Mar. 2021

  11. [11]

    Accent-Robust Automatic Speech Recognition Using Supervised and Unsupervised Wav2vec Em- beddings,

    J. Li, V . Manohar, P. Chitkara, A. Tjandra, M. Picheny, F. Zhang, X. Zhang, and Y . Saraf, “Accent-Robust Automatic Speech Recognition Using Supervised and Unsupervised Wav2vec Em- beddings,” Oct. 2021

  12. [12]

    Exploring Gender Disparities in Automatic Speech Recognition Technology,

    H. ElGhazaly, B. Mirheidari, N. S. Moosavi, and H. Christensen, “Exploring Gender Disparities in Automatic Speech Recognition Technology,” Feb. 2025

  13. [13]

    Investigating the Im- pact of Gender Representation in ASR Training Data: A Case Study on Librispeech,

    M. Garnerin, S. Rossato, and L. Besacier, “Investigating the Im- pact of Gender Representation in ASR Training Data: A Case Study on Librispeech,” inProceedings of the 3rd Workshop on Gender Bias in Natural Language Processing, M. Costa-jussa, H. Gonen, C. Hardmeier, and K. Webster, Eds. Online: As- sociation for Computational Linguistics, Aug. 2021, pp. 86–92

  14. [14]

    Improving Fair- ness in Speaker Recognition,

    G. Fenu, G. Medda, M. Marras, and G. Meloni, “Improving Fair- ness in Speaker Recognition,” inProceedings of the 2020 Euro- pean Symposium on Software Engineering. Rome Italy: ACM, Nov. 2020, pp. 129–136

  15. [15]

    Sonos V oice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in V oice As- sistants,

    C. Sekkat, F. Leroy, S. Mdhaffar, B. P. Smith, Y . Est`eve, J. Dureau, and A. Coucke, “Sonos V oice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in V oice As- sistants,” inProceedings of the 2024 Joint International Con- ference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzola...

  16. [16]

    A study of bias mit- igation strategies for speaker recognition,

    R. Peri, K. Somandepalli, and S. Narayanan, “A study of bias mit- igation strategies for speaker recognition,”Comput. Speech Lang., vol. 79, no. C, Apr. 2023

  17. [17]

    To- wards inclusive automatic speech recognition,

    S. Feng, B. M. Halpern, O. Kudina, and O. Scharenborg, “To- wards inclusive automatic speech recognition,”Computer Speech & Language, vol. 84, p. 101567, Mar. 2024

  18. [18]

    Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Represen- tations,

    A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Represen- tations,” Oct. 2020

  19. [19]

    HuBERT: Self-Supervised Speech Rep- resentation Learning by Masked Prediction of Hidden Units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “HuBERT: Self-Supervised Speech Rep- resentation Learning by Masked Prediction of Hidden Units,” Jun. 2021

  20. [20]

    Robust Speech Recognition via Large-Scale Weak Supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” Dec. 2022

  21. [21]

    WavLM: Large- Scale Self-Supervised Pre-Training for Full Stack Speech Pro- cessing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large- Scale Self-Supervised Pre-Training for Full Stack Speech Pro- cessing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, Oct. 2022

  22. [22]

    SUPERB: Speech processing Universal PERformance Benchmark,

    S.-w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K.-t. Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H.-y. Lee, “SUPERB: Speech processing Universal PERformance Benchmark,” Oct. 2021

  23. [23]

    Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks,

    H.-J. Na and J.-S. Park, “Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks,” Applied Sciences, vol. 11, p. 8412, Sep. 2021

  24. [24]

    Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning,

    A. Jain, M. Upreti, and P. Jyothi, “Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning,” inInterspeech 2018. ISCA, Sep. 2018, pp. 2454–2458

  25. [25]

    Toward Fairness in Speech Recognition: Discovery and mitigation of per- formance disparities,

    P. Dheram, M. Ramakrishnan, A. Raju, I.-F. Chen, B. King, K. Powell, M. Saboowala, K. Shetty, and A. Stolcke, “Toward Fairness in Speech Recognition: Discovery and mitigation of per- formance disparities,” inInterspeech 2022, Sep. 2022, pp. 1268– 1272

  26. [26]

    Fairness in Automatic Speech Recognition Isn’t a One-Size- Fits-All,

    H. ElGhazaly, B. Mirheidari, H. Christensen, and N. S. Moosavi, “Fairness in Automatic Speech Recognition Isn’t a One-Size- Fits-All,” inFindings of the Association for Computational Lin- guistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 19 169–19 178

  27. [27]

    Vers l’apprentissage de mod `eles auto-supervis ´es de reconnais- sance automatique de la parole plus ´equitables sans a priori d´emographique,

    L. Alonzo-Canul, B. Lecouteux, and F. Portet, “Vers l’apprentissage de mod `eles auto-supervis ´es de reconnais- sance automatique de la parole plus ´equitables sans a priori d´emographique,” 2025

  28. [28]

    Some V oices are Too Common: Build- ing Fair Speech Recognition Systems Using the CommonV oice Dataset,

    L. Maison and Y . Est`eve, “Some V oices are Too Common: Build- ing Fair Speech Recognition Systems Using the CommonV oice Dataset,” inINTERSPEECH 2023. ISCA, Aug. 2023, pp. 4428– 4432

  29. [29]

    Enhancing and Adversarial: Improve ASR with Speaker Labels,

    W. Zhou, H. Wu, J. Xu, M. Zeineldeen, C. L ¨uscher, R. Schl ¨uter, and H. Ney, “Enhancing and Adversarial: Improve ASR with Speaker Labels,” inICASSP 2023 - 2023 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2023, pp. 1–5

  30. [30]

    Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks,

    T. Tanaka, R. Masumura, H. Sato, M. Ihori, K. Matsuura, T. Ashihara, and T. Moriya, “Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks,” inInterspeech 2022. ISCA, Sep. 2022, pp. 1066–1070

  31. [31]

    Investigating Phoneme Sim- ilarity with Artificially Accented Speech,

    M. Masson and J. Carson-berndsen, “Investigating Phoneme Sim- ilarity with Artificially Accented Speech,” inProceedings of the 20th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, G. Nicolai, E. Chodroff, F. Mailhot, and C ¸ . C ¸¨oltekin, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 49–57

  32. [32]

    Quan- tifying Bias in Automatic Speech Recognition,

    S. Feng, O. Kudina, B. M. Halpern, and O. Scharenborg, “Quan- tifying Bias in Automatic Speech Recognition,” Apr. 2021

  33. [33]

    Pre- trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition,

    I. Slaughter, C. Greenberg, R. Schwartz, and A. Caliskan, “Pre- trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition,” Oct. 2023

  34. [34]

    Beyond Static Emotions: Leveraging Multitask Learning to Model Dynamics of Dimensional Affect in Speech,

    Y . Zhang, H. Fournier, R. Kalitvianski, M. Dinarelli, and F. Ringeval, “Beyond Static Emotions: Leveraging Multitask Learning to Model Dynamics of Dimensional Affect in Speech,” inText, Speech, and Dialogue, K. Ekˇstein, M. Konop´ık, O. Praˇz´ak, and F. P´artl, Eds. Cham: Springer Nature Switzerland, 2026, pp. 109–120

  35. [35]

    Joint Encoder-Decoder Self-Supervised Pre- training for ASR,

    A. A and U. S, “Joint Encoder-Decoder Self-Supervised Pre- training for ASR,” Jun. 2022

  36. [36]

    SpeechBrain: A General-Purpose Speech Toolkit,

    M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. D. Mori, and Y . Bengio, “SpeechBrain: A General-Purpose Speech Toolkit,” Jun. 2021

  37. [37]

    Towards measuring fairness in speech recog- nition: Fair-Speech dataset,

    I.-E. Veliche, Z. Huang, V . A. Kochaniyan, F. Peng, O. Kalinli, and M. L. Seltzer, “Towards measuring fairness in speech recog- nition: Fair-Speech dataset,” Aug. 2024

  38. [38]

    Fairness definitions explained,

    S. Verma and J. Rubin, “Fairness definitions explained,” inPro- ceedings of the International Workshop on Software Fairness, ser. FairWare ’18. New York, NY , USA: Association for Computing Machinery, May 2018, pp. 1–7

  39. [39]

    Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps,

    G. Attanasio, B. Savoldi, D. Fucci, and D. Hovy, “Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps,” Oct. 2024

  40. [40]

    MinMax fairness: From Rawlsian Theory of Justice to solution for algorithmic bias,

    F. Barsotti and R. G. Koc ¸er, “MinMax fairness: From Rawlsian Theory of Justice to solution for algorithmic bias,”AI & SOCI- ETY, vol. 39, no. 3, pp. 961–974, Jun. 2024

  41. [41]

    Self-supervised Learning with Random-projection Quantizer for Speech Recogni- tion,

    C.-C. Chiu, J. Qin, Y . Zhang, J. Yu, and Y . Wu, “Self-supervised Learning with Random-projection Quantizer for Speech Recogni- tion,” Jun. 2022

  42. [42]

    Open Implementation and Study of BEST-RQ for Speech Processing,

    R. Whetten, T. Parcollet, M. Dinarelli, and Y . Est `eve, “Open Implementation and Study of BEST-RQ for Speech Processing,” Sep. 2024

  43. [43]

    XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,

    A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,” Dec. 2021

  44. [44]

    LeBenchmark 2.0: A Standardized, Replicable and Enhanced Framework for Self-supervised Repre- sentations of French Speech,

    T. Parcollet, H. Nguyen, S. Evain, M. Z. Boito, A. Pupier, S. Mdhaffar, H. Le, S. Alisamir, N. Tomashenko, M. Dinarelli, S. Zhang, A. Allauzen, M. Coavoux, Y . Esteve, M. Rouvier, J. Goulian, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, and L. Besacier, “LeBenchmark 2.0: A Standardized, Replicable and Enhanced Framework for Self-supervised...

  45. [45]

    Libri-Light: A Benchmark for ASR with Limited or No Supervision,

    J. Kahn, M. Rivi `ere, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazar´e, J. Karadayi, V . Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux, “Libri-Light: A Benchmark for ASR with Limited or No Supervision,” inICASSP 2020 - 2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (...

  46. [46]

    Lib- rispeech: An ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. 5206–5210

  47. [47]

    Gender Representa- tion in French Broadcast Corpora and Its Impact on ASR Perfor- mance,

    M. Garnerin, S. Rossato, and L. Besacier, “Gender Representa- tion in French Broadcast Corpora and Its Impact on ASR Perfor- mance,” Aug. 2019

  48. [48]

    X-Vectors: Robust DNN Embeddings for Speaker Recog- nition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-Vectors: Robust DNN Embeddings for Speaker Recog- nition,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018, pp. 5329– 5333

  49. [49]

    Speaker Group Encoding in Self-supervised Speech Recognition Models,

    F. Herron, S. Rossato, A. Allauzen, B. Favre, and F. Portet, “Speaker Group Encoding in Self-supervised Speech Recognition Models,” inText, Speech, and Dialogue, K. Ekˇstein, M. Konop´ık, O. Praˇz´ak, and F. P´artl, Eds. Cham: Springer Nature Switzerland, 2025, pp. 121–132

  50. [50]

    Layer-wise Analysis of a Self-supervised Speech Representation Model,

    A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise Analysis of a Self-supervised Speech Representation Model,” Dec. 2022

  51. [51]

    Mitigating bias against non-native accents,

    Y . Zhang, Y . Zhang, B. Halpern, T. Patel, and O. Scharenborg, “Mitigating bias against non-native accents,” inProc. Interspeech 2022, 2022, pp. 3168–3172

  52. [52]

    Artie Bias Corpus: An Open Dataset for Detecting Demographic Bias in Speech Applications,

    J. Meyer, L. Rauchenstein, J. D. Eisenberg, and N. Howell, “Artie Bias Corpus: An Open Dataset for Detecting Demographic Bias in Speech Applications,” inProceedings of the Twelfth Language Resources and Evaluation Conference, N. Calzolari, F. B ´echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa- hara, B. Maegaard, J. Mariani, H. Mazo, ...

  53. [53]

    Racial disparities in automated speech recognition,

    A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Menge- sha, C. Toups, J. R. Rickford, D. Jurafsky, and S. Goel, “Racial disparities in automated speech recognition,”Proceedings of the National Academy of Sciences, vol. 117, no. 14, pp. 7684–7689, Apr. 2020

  54. [54]

    Don’t speak too fast: The impact of data bias on self-supervised speech models,

    Y . Meng, Y .-H. Chou, A. T. Liu, and H.-y. Lee, “Don’t speak too fast: The impact of data bias on self-supervised speech models,” Apr. 2022