A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models

Marco Dinarelli; Ryan Whetten; Titouan Parcollet; Yannick Est\`eve

arxiv: 2601.20896 · v2 · submitted 2026-01-28 · 💻 cs.SD · eess.AS

A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models

Ryan Whetten , Titouan Parcollet , Marco Dinarelli , Yannick Est\`eve This is my paper

Pith reviewed 2026-05-16 10:14 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords self-supervised learningspeech processingdata selectionutterance lengthautomatic speech recognitionpre-training efficiency

0 comments

The pith

Prioritizing longest utterances for pre-training self-supervised speech models yields better ASR performance with only half the dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests multiple ways to pick subsets from large speech corpora for self-supervised pre-training and measures the effect on downstream automatic speech recognition. Acoustic, speaker, and linguistic diversity strategies show no advantage over plain random sampling. In contrast, keeping only the longest utterances produces stronger ASR results while cutting the data volume in half and shortening pre-training time by 24 percent. This outcome indicates that utterance length can matter more for efficiency and final performance than either total data size or measured diversity.

Core claim

Systematic comparison of selection methods shows that ranking utterances by length and retaining the longest half produces higher ASR accuracy than random sampling or any of the tested diversity criteria, while also lowering pre-training compute by 24 percent on large corpora.

What carries the argument

Utterance-length ranking as the primary filter for forming pre-training subsets, applied before any acoustic or linguistic diversity metrics.

If this is right

Pre-training pipelines can drop to 50 percent data volume without performance loss.
Random or diversity-driven sampling can be replaced by a simple length sort for both speed and accuracy gains.
Data collection efforts may shift priority toward recording longer continuous speech rather than maximizing speaker or acoustic variety.
Training budgets can be reallocated from data volume to other factors such as model size or iteration count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same length-first rule might be worth testing on text or vision SSL pre-training where sequence length is also variable.
If longer utterances carry more temporal context, downstream tasks that rely on long-range dependencies could benefit disproportionately.
Future work could measure whether the length advantage persists when utterance quality or transcription accuracy is controlled.

Load-bearing premise

That longer utterances supply a richer training signal for the model in a way that is not explained by other properties of those utterances.

What would settle it

Running the identical selection experiments on a fresh large speech corpus with a different SSL architecture and observing no ASR gain or a loss from the length-based subset.

read the original abstract

Self-supervised learning (SSL) has transformed speech processing, yet its reliance on massive pre-training datasets remains a bottleneck. While robustness is often attributed to scale and diversity, the role of the data distribution is less understood. We systematically examine how curated subsets of pre-training data influence Automatic Speech Recognition (ASR) performance. Surprisingly, optimizing for acoustic, speaker, or linguistic diversity yields no clear improvements over random sampling. Instead, we find that prioritizing the longest utterances achieves superior ASR results while using only half the original dataset, reducing pre-training time by 24% on a large corpora. These findings suggest that for pre-training speech SSL models, data length is a more critical factor than either data diversity or overall data quantity for performance and efficiency, offering a new perspective for data selection strategies in SSL speech processing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Length-based selection for SSL speech pre-training beats diversity baselines and trims wall time by 24% on half the utterances.

read the letter

The main takeaway is that selecting the longest utterances for pre-training SSL speech models outperforms random sampling and diversity-based filters, producing better downstream ASR while cutting pre-training time by 24% on a large corpus. The paper runs a clean ablation across acoustic, speaker, and linguistic diversity measures, showing none of them reliably beat random selection. Length turns out to be the stronger signal, which is the practical angle worth noting for anyone scaling these models under compute limits. They evaluate the subsets by training the SSL model and then measuring ASR performance, which is the right downstream test. The result is reported on sizable data, so the efficiency number is at least grounded in real runs rather than toy scales. The soft spot is the exact meaning of “half the original dataset.” The abstract gives a 24% time reduction rather than something closer to 50%, which suggests the retained subset may still hold most of the total audio hours if the length distribution is skewed. Without the methods details it is unclear whether they measured total duration explicitly or controlled for correlated factors such as utterance quality. The gains also lack reported error bars or statistical tests in the summary, so the size of the advantage is hard to judge. This is the sort of incremental empirical paper that would interest groups working on efficient SSL training for speech. It is not a foundational result, but the finding is concrete enough that a referee could usefully check the controls and test generalization. I would send it to review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper claims that in pre-training self-supervised speech models, selecting the longest utterances from the dataset leads to better Automatic Speech Recognition (ASR) performance than random sampling or selections based on acoustic, speaker, or linguistic diversity. This approach uses only half the original dataset and reduces pre-training time by 24% on large corpora, suggesting that data length is more important than diversity or quantity.

Significance. If validated, these findings are significant for the field of speech SSL, as they offer a straightforward data curation method that enhances both performance and computational efficiency. The empirical nature of the study, involving systematic ablations, provides valuable insights into data distribution effects, potentially influencing future pre-training strategies by prioritizing length over diversity.

major comments (2)

Abstract: The central claim that the length-based selection uses 'only half the original dataset' while achieving superior results is not fully supported without explicit reporting of total audio duration in the selected subset versus the full set. The 24% pre-training time reduction indicates that the data quantity reduction may not be 50% in terms of speech hours, especially if utterance lengths are skewed; this assumption is load-bearing for the efficiency and 'less quantity' aspect of the conclusion.
Results section: The manuscript does not provide error bars, statistical tests, or details on controls for potential confounders (e.g., longer utterances correlating with specific speakers or content). This makes it difficult to confirm that the observed ASR improvements are causally due to length prioritization rather than other factors.

minor comments (1)

Abstract: Consider specifying the exact datasets, model architectures, and ASR evaluation metrics used to make the claims more concrete for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped clarify key aspects of our work. We address each major comment below and have revised the manuscript to improve reporting of data quantities and add statistical details.

read point-by-point responses

Referee: Abstract: The central claim that the length-based selection uses 'only half the original dataset' while achieving superior results is not fully supported without explicit reporting of total audio duration in the selected subset versus the full set. The 24% pre-training time reduction indicates that the data quantity reduction may not be 50% in terms of speech hours, especially if utterance lengths are skewed; this assumption is load-bearing for the efficiency and 'less quantity' aspect of the conclusion.

Authors: We agree that the phrasing 'half the original dataset' is ambiguous and requires clarification. In our experiments, length-based selection retains the longest 50% of utterances by count. We have now explicitly computed the total audio duration of this subset, which corresponds to 76% of the full dataset's speech hours. This is consistent with the reported 24% reduction in pre-training time, since training duration scales directly with total audio hours. We have revised the abstract to specify 'half the number of utterances' and added a table in the results section reporting both utterance counts and total durations for all compared selection strategies. These changes ensure the efficiency claims are precisely supported without overstating the reduction in data volume. revision: yes
Referee: Results section: The manuscript does not provide error bars, statistical tests, or details on controls for potential confounders (e.g., longer utterances correlating with specific speakers or content). This makes it difficult to confirm that the observed ASR improvements are causally due to length prioritization rather than other factors.

Authors: We acknowledge the value of these additions for strengthening the claims. In the revised manuscript, we have added error bars to all ASR WER tables, computed as standard deviation across five independent fine-tuning runs with different random seeds. We have also included paired statistical tests (t-tests) confirming that the improvements from length-based selection are significant (p < 0.05) relative to random and diversity-based baselines. To address potential confounders, we added an analysis of speaker distribution and linguistic content (measured via average phoneme diversity and utterance length statistics) across selection methods, showing that the longest-utterance subset maintains speaker and content diversity comparable to random sampling. These controls and results are now presented in the updated results section and supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ablation with no derivations or self-referential reductions

full rationale

The paper is an empirical ablation study comparing data selection strategies (random, diversity-based, length-based) for SSL speech pre-training. All claims are direct experimental observations on ASR performance and wall-clock time for specific corpora and models. No equations, fitted parameters, uniqueness theorems, or self-citations are used as load-bearing premises. The reported superiority of longest-utterance selection and the 24% time reduction are presented as measured outcomes rather than derived from any prior result internal to the paper. The study is therefore self-contained against external benchmarks with no reduction of any 'prediction' to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests entirely on empirical comparisons; no new free parameters, axioms beyond standard SSL assumptions, or invented entities are introduced.

axioms (1)

domain assumption Standard SSL objectives (contrastive or masked prediction) and ASR fine-tuning pipelines remain effective when data volume is reduced via length filtering.
The paper assumes existing SSL frameworks transfer without modification to the curated subsets.

pith-pipeline@v0.9.0 · 5444 in / 1142 out tokens · 27080 ms · 2026-05-16T10:14:28.947257+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models
cs.CL 2026-04 unverdicted novelty 5.0

Phoneme embeddings in self-supervised ASR models show both random variance and systematic bias as sources of demographic unfairness, with variance hindering fairness more than bias.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

INTRODUCTION Self-supervised learning (SSL) is a framework for training neural networks where pseudo-targets are generated from the input data itself. A neural network model is trained to using these targets in a stage called pre-training, and then can be subsequently fine-tuned using human labeled data for a par- ticular downstream task. SSL in speech pr...

work page
[2]

RELATED WORK In this section, we provide an overview of prior work on im- proving the efficiency of SSL for speech models, as well as studies on data selection for speech processing. 2.1. Efficiency for SSL speech models Research on addressing the inefficiency of SSL speech mod- els has primarily focused on three directions: modifications to the SSL train...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

METHODS In this section, we describe the dataset used, the different methods for data selection, and the training environment. 3.1. Data For our experiments, we use the Loquacious dataset [16], which combines commercially usable English speech cor- pora. We select this dataset because it is both large-scale and challenging, containing 25,000 hours of dive...

work page 2048
[4]

Overall, diversity-based sampling methods did not yield significant improvements over the either baseline

RESULTS The results are presented in Table 1. Overall, diversity-based sampling methods did not yield significant improvements over the either baseline. The only exception is the speaker- based method on the large split, which achieved a word error rate of 17.97 on the test set—significantly better than the random baseline (18.54) but not theallbaseline (...

work page
[5]

In contrast, selecting subsets of longer utterances consistently leads to lower WER, despite these subsets being most out-of-distribution relative to the fine-tuning data

DISCUSSION AND CONCLUSION From our results, we find that reducing the amount of pre- training data by sampling based on acoustic, speaker, or lin- guistic features does not yield improvements over a random baseline. In contrast, selecting subsets of longer utterances consistently leads to lower WER, despite these subsets being most out-of-distribution rel...

work page
[6]

Self-supervised speech representation learning: A review,

A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe, et al., “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, 2022

work page 2022
[7]

wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,”Advances in neural information processing sys- tems, vol. 33, pp. 12449–12460, 2020

work page 2020
[8]

An analysis of linear complexity attention substitutes with best-rq,

R. Whetten, T. Parcollet, A. Moumen, M. Dinarelli, and Y . Est`eve, “An analysis of linear complexity attention substitutes with best-rq,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 169–176

work page 2024
[9]

Efficient self- supervised learning with contextualized target representations for vision, speech and language,

A. Baevski, A. Babu, W.-N. Hsu, and M. Auli, “Efficient self- supervised learning with contextualized target representations for vision, speech and language,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1416–1429

work page 2023
[10]

Reducing barriers to self-supervised learning: Hubert pre-training with academic compute,

W. Chen, X. Chang, Y . Peng, Z. Ni, S. Maiti, and S. Watan- abe, “Reducing barriers to self-supervised learning: Hubert pre-training with academic compute,” inInterspeech 2023, 2023, pp. 4404–4408

work page 2023
[11]

Self- supervised learning with random-projection quantizer for speech recognition,

C.-C. Chiu, J. Qin, Y . Zhang, J. Yu, and Y . Wu, “Self- supervised learning with random-projection quantizer for speech recognition,” inProceedings of the 39th International Conference on Machine Learning, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds. 17–23 Jul 2022, vol. 162 ofProceedings of Machine Learning Research, pp....

work page 2022
[12]

Towards Early Prediction of Self-Supervised Speech Model Performance,

R. Whetten, L. Maison, T. Parcollet, M. Dinarelli, and Y . Est`eve, “Towards Early Prediction of Self-Supervised Speech Model Performance,” inInterspeech 2025, 2025, pp. 1228– 1232

work page 2025
[13]

S., Haghani, P., Riesa, J., Perng, G., Soltau, H., Strohman, T., Ramabhadran, B., Sainath, T

Y . Zhang, W. Han, J. Qin, Y . Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V . Axelrod, G. Wang, et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,”arXiv preprint arXiv:2303.01037, 2023

work page arXiv 2023
[14]

Towards robust speech representation learning for thousands of languages,

W. Chen, W. Zhang, Y . Peng, X. Li, J. Tian, J. Shi, X. Chang, S. Maiti, K. Livescu, and S. Watanabe, “Towards robust speech representation learning for thousands of languages,” inPro- ceedings of the 2024 Conference on Empirical Methods in Nat- ural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .- N. Chen, Eds., Miami, Florida, USA, Nov. 2024, pp...

work page 2024
[15]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28492–28518

work page 2023
[16]

DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective,

H. gun Chi, Z. Aldeneh, T. Likhomanenko, O. Rudovic, T. Higuchi, L.-W. Chen, S. Watanabe, and A. H. Abdelaziz, “DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective,” inInterspeech 2025, 2025, pp. 1218– 1222

work page 2025
[17]

Towards automatic assessment of self- supervised speech models using rank,

Z. Aldeneh, V . Thilak, T. Higuchi, B.-J. Theobald, and T. Likhomanenko, “Towards automatic assessment of self- supervised speech models using rank,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2025, pp. 1–5

work page 2025
[18]

Active learning methods for low resource end-to-end speech recognition,

K. Malhotra, S. Bansal, and S. Ganapathy, “Active learning methods for low resource end-to-end speech recognition,” in Interspeech 2019, 2019, pp. 2215–2219

work page 2019
[19]

Unsupervised Data Selection via Discrete Speech Representation for ASR,

Zhiyun Lu and Yongqiang Wang and Yu Zhang and Wei Han and Zhehuai Chen and Parisa Haghani, “Unsupervised Data Selection via Discrete Speech Representation for ASR,” inIn- terspeech 2022, 2022, pp. 3393–3397

work page 2022
[20]

More speaking or more speakers?,

D. Berrebbi, R. Collobert, N. Jaitly, and T. Likhomanenko, “More speaking or more speakers?,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[21]

Loqua- cious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use,

T. Parcollet, Y . Tseng, S. Zhang, and R. C. van Dalen, “Loqua- cious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use,” inInterspeech 2025, 2025, pp. 4053–4057

work page 2025
[22]

Wespeaker: A research and production oriented speaker embedding learning toolkit,

H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” inICASSP 2023, IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[23]

pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,

H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” inProc. INTERSPEECH 2023, 2023, pp. 1983–1987

work page 2023
[24]

Sense models: an open source solution for multilingual and multimodal semantic-based tasks,

S. Mdhaffar, H. Elleuch, C. Chellaf, H. Nguyen, and Y . Est`eve, “Sense models: an open source solution for multilingual and multimodal semantic-based tasks,” 2025

work page 2025
[25]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neu- rocomputing, vol. 568, pp. 127063, 2024

work page 2024
[26]

Open implementation and study of best-rq for speech processing,

R. Whetten, T. Parcollet, M. Dinarelli, and Y . Est `eve, “Open implementation and study of best-rq for speech processing,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2024, pp. 460– 464

work page 2024
[27]

SpeechBrain: A general- purpose speech toolkit

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. D. Mori, and Y . Ben- gio, “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624

work page arXiv 2021
[28]

Unified streaming and non-streaming two-pass end-to-end model for speech recognition,

B. Zhang, D. Wu, Z. Yao, X. Wang, F. Yu, C. Yang, L. Guo, Y . Hu, L. Xie, and X. Lei, “Unified streaming and non-streaming two-pass end-to-end model for speech recognition,”arXiv preprint arXiv:2012.05481, 2020

work page arXiv 2012
[29]

Dynamic chunk convolution for unified streaming and non- streaming conformer asr,

X. Li, G. Huybrechts, S. Ronanki, J. Farris, and S. Bodapati, “Dynamic chunk convolution for unified streaming and non- streaming conformer asr,” inICASSP 2023-2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[30]

In-domain ssl pre-training and streaming asr,

J. Duret, S. Mdhaffar, G. Laperri `ere, R. Whetten, A. Galametz, C. Kobus, M.-C. Martin, J. Oleiwan, and Y . Est`eve, “In-domain ssl pre-training and streaming asr,” 2025

work page 2025

[1] [1]

INTRODUCTION Self-supervised learning (SSL) is a framework for training neural networks where pseudo-targets are generated from the input data itself. A neural network model is trained to using these targets in a stage called pre-training, and then can be subsequently fine-tuned using human labeled data for a par- ticular downstream task. SSL in speech pr...

work page

[2] [2]

RELATED WORK In this section, we provide an overview of prior work on im- proving the efficiency of SSL for speech models, as well as studies on data selection for speech processing. 2.1. Efficiency for SSL speech models Research on addressing the inefficiency of SSL speech mod- els has primarily focused on three directions: modifications to the SSL train...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

METHODS In this section, we describe the dataset used, the different methods for data selection, and the training environment. 3.1. Data For our experiments, we use the Loquacious dataset [16], which combines commercially usable English speech cor- pora. We select this dataset because it is both large-scale and challenging, containing 25,000 hours of dive...

work page 2048

[4] [4]

Overall, diversity-based sampling methods did not yield significant improvements over the either baseline

RESULTS The results are presented in Table 1. Overall, diversity-based sampling methods did not yield significant improvements over the either baseline. The only exception is the speaker- based method on the large split, which achieved a word error rate of 17.97 on the test set—significantly better than the random baseline (18.54) but not theallbaseline (...

work page

[5] [5]

In contrast, selecting subsets of longer utterances consistently leads to lower WER, despite these subsets being most out-of-distribution relative to the fine-tuning data

DISCUSSION AND CONCLUSION From our results, we find that reducing the amount of pre- training data by sampling based on acoustic, speaker, or lin- guistic features does not yield improvements over a random baseline. In contrast, selecting subsets of longer utterances consistently leads to lower WER, despite these subsets being most out-of-distribution rel...

work page

[6] [6]

Self-supervised speech representation learning: A review,

A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe, et al., “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, 2022

work page 2022

[7] [7]

wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,”Advances in neural information processing sys- tems, vol. 33, pp. 12449–12460, 2020

work page 2020

[8] [8]

An analysis of linear complexity attention substitutes with best-rq,

R. Whetten, T. Parcollet, A. Moumen, M. Dinarelli, and Y . Est`eve, “An analysis of linear complexity attention substitutes with best-rq,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 169–176

work page 2024

[9] [9]

Efficient self- supervised learning with contextualized target representations for vision, speech and language,

A. Baevski, A. Babu, W.-N. Hsu, and M. Auli, “Efficient self- supervised learning with contextualized target representations for vision, speech and language,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1416–1429

work page 2023

[10] [10]

Reducing barriers to self-supervised learning: Hubert pre-training with academic compute,

W. Chen, X. Chang, Y . Peng, Z. Ni, S. Maiti, and S. Watan- abe, “Reducing barriers to self-supervised learning: Hubert pre-training with academic compute,” inInterspeech 2023, 2023, pp. 4404–4408

work page 2023

[11] [11]

Self- supervised learning with random-projection quantizer for speech recognition,

C.-C. Chiu, J. Qin, Y . Zhang, J. Yu, and Y . Wu, “Self- supervised learning with random-projection quantizer for speech recognition,” inProceedings of the 39th International Conference on Machine Learning, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds. 17–23 Jul 2022, vol. 162 ofProceedings of Machine Learning Research, pp....

work page 2022

[12] [12]

Towards Early Prediction of Self-Supervised Speech Model Performance,

R. Whetten, L. Maison, T. Parcollet, M. Dinarelli, and Y . Est`eve, “Towards Early Prediction of Self-Supervised Speech Model Performance,” inInterspeech 2025, 2025, pp. 1228– 1232

work page 2025

[13] [13]

S., Haghani, P., Riesa, J., Perng, G., Soltau, H., Strohman, T., Ramabhadran, B., Sainath, T

Y . Zhang, W. Han, J. Qin, Y . Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V . Axelrod, G. Wang, et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,”arXiv preprint arXiv:2303.01037, 2023

work page arXiv 2023

[14] [14]

Towards robust speech representation learning for thousands of languages,

W. Chen, W. Zhang, Y . Peng, X. Li, J. Tian, J. Shi, X. Chang, S. Maiti, K. Livescu, and S. Watanabe, “Towards robust speech representation learning for thousands of languages,” inPro- ceedings of the 2024 Conference on Empirical Methods in Nat- ural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .- N. Chen, Eds., Miami, Florida, USA, Nov. 2024, pp...

work page 2024

[15] [15]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28492–28518

work page 2023

[16] [16]

DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective,

H. gun Chi, Z. Aldeneh, T. Likhomanenko, O. Rudovic, T. Higuchi, L.-W. Chen, S. Watanabe, and A. H. Abdelaziz, “DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective,” inInterspeech 2025, 2025, pp. 1218– 1222

work page 2025

[17] [17]

Towards automatic assessment of self- supervised speech models using rank,

Z. Aldeneh, V . Thilak, T. Higuchi, B.-J. Theobald, and T. Likhomanenko, “Towards automatic assessment of self- supervised speech models using rank,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2025, pp. 1–5

work page 2025

[18] [18]

Active learning methods for low resource end-to-end speech recognition,

K. Malhotra, S. Bansal, and S. Ganapathy, “Active learning methods for low resource end-to-end speech recognition,” in Interspeech 2019, 2019, pp. 2215–2219

work page 2019

[19] [19]

Unsupervised Data Selection via Discrete Speech Representation for ASR,

Zhiyun Lu and Yongqiang Wang and Yu Zhang and Wei Han and Zhehuai Chen and Parisa Haghani, “Unsupervised Data Selection via Discrete Speech Representation for ASR,” inIn- terspeech 2022, 2022, pp. 3393–3397

work page 2022

[20] [20]

More speaking or more speakers?,

D. Berrebbi, R. Collobert, N. Jaitly, and T. Likhomanenko, “More speaking or more speakers?,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023

[21] [21]

Loqua- cious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use,

T. Parcollet, Y . Tseng, S. Zhang, and R. C. van Dalen, “Loqua- cious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use,” inInterspeech 2025, 2025, pp. 4053–4057

work page 2025

[22] [22]

Wespeaker: A research and production oriented speaker embedding learning toolkit,

H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” inICASSP 2023, IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023

[23] [23]

pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,

H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” inProc. INTERSPEECH 2023, 2023, pp. 1983–1987

work page 2023

[24] [24]

Sense models: an open source solution for multilingual and multimodal semantic-based tasks,

S. Mdhaffar, H. Elleuch, C. Chellaf, H. Nguyen, and Y . Est`eve, “Sense models: an open source solution for multilingual and multimodal semantic-based tasks,” 2025

work page 2025

[25] [25]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neu- rocomputing, vol. 568, pp. 127063, 2024

work page 2024

[26] [26]

Open implementation and study of best-rq for speech processing,

R. Whetten, T. Parcollet, M. Dinarelli, and Y . Est `eve, “Open implementation and study of best-rq for speech processing,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2024, pp. 460– 464

work page 2024

[27] [27]

SpeechBrain: A general- purpose speech toolkit

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. D. Mori, and Y . Ben- gio, “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624

work page arXiv 2021

[28] [28]

Unified streaming and non-streaming two-pass end-to-end model for speech recognition,

B. Zhang, D. Wu, Z. Yao, X. Wang, F. Yu, C. Yang, L. Guo, Y . Hu, L. Xie, and X. Lei, “Unified streaming and non-streaming two-pass end-to-end model for speech recognition,”arXiv preprint arXiv:2012.05481, 2020

work page arXiv 2012

[29] [29]

Dynamic chunk convolution for unified streaming and non- streaming conformer asr,

X. Li, G. Huybrechts, S. Ronanki, J. Farris, and S. Bodapati, “Dynamic chunk convolution for unified streaming and non- streaming conformer asr,” inICASSP 2023-2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023

[30] [30]

In-domain ssl pre-training and streaming asr,

J. Duret, S. Mdhaffar, G. Laperri `ere, R. Whetten, A. Galametz, C. Kobus, M.-C. Martin, J. Oleiwan, and Y . Est`eve, “In-domain ssl pre-training and streaming asr,” 2025

work page 2025